Teaching Robots to Think in 3D

Core Components

Image Encoder

A critical component responsible for perceiving the environment and extracting features from the current state, such as object types, positions, and other attributes. It serves as the "eyes" of the system, extracting scene features that are subsequently used for reasoning and decision-making. Contemporary VLA systems employ transformer-based image encoder architectures.

ViT (CLIP, SigLIP, DINOv2)

Vision Transformer (ViT) is a computer vision architecture that adapts the self-attention mechanism from NLP tasks. The concept is straightforward: ViT divides an image into small patches and projects each into a vector representation (embeddings). These embeddings are augmented with positional encoding before being fed into a transformer encoder.

Unlike conventional convolutional networks, this approach analyzes not only local fragments but also captures relationships between different image regions, enabling the formation of more holistic representations and comprehensive scene understanding.

This autoencoder serves as the foundation for many subsequent transformer architectures, including CLIP, DINO, SAM, and others.

Let's briefly examine several of them:

CLIP. A multimodal architecture capable of understanding the relationship between text and images. CLIP enables joint semantic representation of text and images in a unified embedding space, allowing the model to process both data types concurrently. It comprises a ViT-based image encoder and a text encoder (GPT-based). Through contrastive learning, where text-image pairs are aligned, and leveraging large-scale datasets, the model develops diverse representations and a general understanding of text-image relationships. The main drawback is the lack of pixel-level granularity (e.g., precise segmentation masks or depth estimation) required for high-precision manipulation.

SigLIP 2 — a CLIP modification. Quality improvements were achieved through SSL methods, particularly self-distillation during training. The student model sees only a local image crop, while the teacher model sees the global crop. Thus, the model learns to derive the same representations from details as from the entire image. For instance, seeing a dog's nose, the model mentally "reconstructs" the whole dog. Additional techniques include patch masking reconstruction as in Masked Autoencoders, and generating bounding boxes with captions during training to teach the model to associate image details with text.

SigLIP has become a popular choice for recent open-source models such as OpenVLA and π₀ (Pi-Zero).

DINOv2. Employs an unsupervised self-distillation paradigm, enabling the extraction of highly robust visual features. Unlike CLIP, DINOv2 excels at extracting low-level spatial details and understanding scene geometry, which is critical for physical robot-object interaction. For example, the OpenVLA model uses DINOv2 alongside SigLIP to combine deep semantic understanding with high spatial precision.

Brief Summary:

The contemporary trend involves distilling multiple models (e.g., CLIP, DINOv2, and SAM) into a single architecture or their simultaneous utilization (fusion), as implemented in OpenVLA. This enables the model to simultaneously comprehend "what" lies before it (via CLIP/SigLIP) and "where" it is located (via DINOv2).

3D/Point Cloud encoder

Since robots operate in three-dimensional space, incorporating 3D data enables better accounting for object geometry, poses, and environmental physical constraints. Point clouds are a popular choice for 3D representation, as they are easily extracted from RGB-D camera and LiDAR data.

3D/Point Cloud Encoders are neural network or algorithmic models that transform 3D data into compact, informative representations suitable for downstream tasks.

Point Cloud is a set of points in 3D space, where each point typically has coordinates (x, y, z) and sometimes additional features such as color (RGB), reflected light intensity, surface normals, or even density. Unlike 2D images that lie on a regular pixel grid, points in a cloud are arbitrarily scattered, their quantity may vary, and the distribution is irregular and sparse. The primary objective of an encoder is to convert the point set into a fixed-size vector representation (embedding) that preserves the geometric and structural information of the object.

For instance, point clouds are employed in approaches such as:

DP3 (3D Diffusion Policy): Action module. Utilizes compact 3D representations for learning visuomotor control strategies, providing superior generalization capability. Specifically, the input consists of a scene representation as a point cloud, robot state, goal (e.g., object mask in the point cloud), and DP3 determines how to accomplish this within the given geometry — what actions to take.

LEO: A model integrating a point cloud encoder with a large language model (LLM), enabling the robot to comprehend the 3D world (object geometry, position, orientation, and language embeddings describing what the object is) and plan tasks. It bridges language ↔ 3D geometry ↔ actions. LEO is frequently employed as a bridge between VLM/LLM and action modules such as DP3, transforming the scene not into a raw point cloud, but into a set of meaningful 3D entities.

In brief:

LEO = Object-centric 3D representation + Language embedding

Typical pipeline:

Cameras / depth → 3D scene reconstruction

Scene segmentation → objects are extracted

For each object, its 3D shape, crop/view is processed through a VL encoder

Results in a LEO set: LEO₁ = "red mug", LEO₂ = "cardboard box", LEO₃ = "table"

3D-VLA: A class of systems where language defines the goal, 3D scene representation (point cloud, NeRF, TSDF, etc.) provides understanding of the physical world, and the Action module (often a diffusion policy) generates actual robot movements. Simply put:

Example:

VL model locates the mug and box

3D scene representation is constructed

Target region inside the box is determined

DP3 receives: scene point cloud, mug point cloud, target pose

DP3 generates trajectory: grasp, transport, release

Several popular approaches exist for encoding 3D points:

PointNet and PointNet++

Architectures that operate on unordered 3D points and are invariant to their permutations. Essentially, they seek a permutation-invariant function.

PointNet applies a small MLP (multi-layer perceptron) to each point individually, then aggregates results through a symmetric function (typically max pooling due to its invariance property) to obtain a global object vector.

However, PointNet struggled with capturing local structure, leading to a modified model. PointNet++ introduces hierarchical processing: first, local point clusters (subsets of points) are processed separately, then results are merged, enabling better capture of local structures.

PointNet / PointNet++ are frequently employed as scene encoders, object encoders (LEO), and 3D encoders.

Voxel-based

3D space is divided into a voxel grid (3D analog of pixels). 3D convolutional networks are then applied for feature extraction. However, because standard CNNs are used, precision is lost.

For example, PerAct — an end-to-end Transformer agent for robotic manipulation learning that unifies: 3D scene perception, language understanding, and generation of six-degree-of-freedom (6-DoF) actions for robots.

Rather than constructing the scene through object detection → pose estimation or deriving policy from 2D images, PerAct operates as follows:

Voxelizes the scene in 3D — i.e., discretizes space into cubes (voxels), populating them with RGB-D data. This yields a structured representation of scene geometry (object shapes and positions).
Encodes language (instruction) through a language model to understand what exactly needs to be done.
Fuses voxelized 3D information with the goal embedding and processes them through a Perceiver Transformer — an architecture capable of efficiently handling extremely long data sequences (up to millions of voxels).
The model outputs a discretized action in 3D — end-effector (manipulator) position and orientation, gripper state (open/closed), and parameters for the motion planner. This is achieved through "next-best-voxel-action detection."

GNN for Points

Here, points are treated as graph vertices, with edges connecting nearest neighbors. GNNs can capture local relationships between points, improving representation quality for complex geometry.

Transformers (Point Transformer)

The idea is to apply self-attention to points, allowing each point to "attend" to other points and learn both global and local relationships.

Contrastive and Self-Supervised Encoders

The concept is identical to that for 2D images. The encoder learns such that identical objects under different viewpoints or with noise produce similar embeddings, while different objects yield dissimilar ones. This makes the representation more universal.

Mini Summary

Advantages and Challenges: While 3D encoders provide superior geometric accuracy, their real-time integration is often constrained by computational complexity and the cost of 3D data annotation.

Text encoder

LLM (LLama, Vicuna, PaLM)
T5-base
GPT-based models
Qwen-based models

Action Decoder

Responsible for converting abstract multimodal representations into concrete robot control commands. While visual and language encoders handle perception and understanding, the action decoder translates this knowledge into physical manipulations, completing the "perception — reasoning — action" loop.

Primary Action Representation Formats

The choice of output format determines robot motion precision and smoothness:

• Discrete Tokens: The most prevalent approach, employed in models such as RT-2 and OpenVLA. Actions are represented as tokens, enabling the model to be trained like a standard language system predicting the next token. This is convenient because transformers are easier to train on discrete token prediction, whereas robots operate in continuous space.

An action token is an element from a finite set (vocabulary) encoding a component of an action.

Several discretization variants exist. The simplest is binning. For example, in RT-2, action vectors are quantized into 256 "bins," including end-effector position, rotation, and gripper state. However, the main drawback is the enormous action space.

Alternatively, factorized action tokens: The action is decomposed into sub-actions, each with its own vocabulary (position has its vocabulary, arm rotation has its own, etc.). The action decoder predicts not a single vector, but a sequence of tokens.

A more modern variant: Learned discrete tokens (VQ, codebook), where an autoencoder (VQ-VAE, VQ-GAN, tokenizer) produces a codebook used to translate continuous actions into tokens — after pre-training this autoencoder.

General scheme:

(scene embedding + language embedding) ↓ Action Decoder ↓ [token_1, token_2, …, token_T]

• Continuous Output: Used to achieve high control frequency and precision (e.g., in the π₀ model). Instead of discretization, diffusion models or flow-matching are applied, generating smooth joint trajectories directly. This method scales better for robots with high degrees of freedom.

In the diffusion model case, we denoise our action. Eventually obtaining the actual action.

Compared to discrete tokens, such models are harder to train, more difficult to debug, but offer high precision and smooth motions.

Depending on control task requirements, different neural network architectures are employed:

Diffusion Transformer: For example, in Octo and DexGrasp VLA models.
Autoregressive Transformer: Generates actions step-by-step, e.g., Gato model
MLP with next-token generation. Simpler architectures in OpenVLA and RoboMamba.
MCP (Planner). A planner is a control system component that decides what to do and in what order, rather than how to move motors right now. Some models (e.g., VoxPoser) include predictive model-based control blocks (Model Predictive Control) for decision-making in dynamic environments. This category includes VoxPoser, Mobility VLA, LMM Planner Integration, FLaRe.

Architectural Paradigms

End-to-End

This approach works well for easy, short-horizon tasks with abundant data and no need for extensive planning. However, it struggles with more complex tasks and cannot plan for long-term objectives.

Diffusion-based policies are often categorized as end-to-end, but they address a different problem. Diffusion does not add planning capabilities. Instead, it makes action inference more robust: rather than a single deterministic action, the model describes a distribution of possible actions. This reduces averaging effects and produces "softer" behavior, but the model remains fundamentally reactive.

Consequently, modern VLA systems rarely employ "pure" end-to-end architectures for everything. Typically, they serve as low-level policies excelling at local action execution, while higher-level structures are layered on top: task decomposition, high-level planning, or subgoal selection.

Robotic Transformer 2 (RT-2) — Pioneer of LVLA

Developed by Google DeepMind in 2023, RT-2 became the first large-scale model to establish the VLA paradigm in robotics.

If LLMs learn to predict the next text token, RT-2 learns to predict the next action token, conditioned on images and instructions. Essentially, the task reduces to translating instructions and visual data into token sequences. The LLM is trained to predict the next action token in autoregressive fashion.

• Architecture: A single large autoregressive transformer. Crucially, actions are discretized. The model builds upon pre-trained multimodal backbones such as PaLI-X and PaLM-E.

As the visual encoder: The model employs massive Vision Transformer (ViT) variants, scaling up to 22 billion parameters.

• Training Method: RT-2 employs co-fine-tuning. The model is fine-tuned simultaneously on two distinct data streams: internet-scale visual question-answering datasets (VQA) and robot-specific motion trajectories (frames, text commands, and ground-truth action token sequences collected from robots). This approach ensures the model does not "forget" its accumulated world knowledge (e.g., what a dinosaur looks like) while learning manipulator control.

RT-2 also undergoes fine-tuning with chain-of-thought reasoning. This enables the model to first verbally plan a step (e.g., "need to find something heavy"), then output action tokens.

• Data: The foundation was the dataset from the previous RT-1 model. Internet data for image and text understanding tasks — on which PaLI-X and PaLM-E were trained — was utilized. This comprises millions of image-text pairs describing objects, their properties, and relationships. Later versions (RT-2-X) were trained on the OXE dataset, containing over 1 million trajectories from 22 distinct robot types.

• Limitations: Challenges with long-horizon planning and precise geometry (due to discrete representation).

OpenVLA

Introduced in 2024, OpenVLA serves as an open-source counterpart to the closed RT-2, optimized for broad community adoption.

With 7 billion parameters, it is built upon the Llama 2 language model. Unlike RT-2, it fuses features from two distinct visual encoders — DINOv2 and SigLIP — enhancing spatial precision. The model processes camera images, text instructions, and robot proprioceptive data (joint states), converting them into a unified token sequence. OpenVLA predicts 7-dimensional action vectors (gripper position, orientation, and grasp state).

Despite its smaller size, OpenVLA outperforms RT-2 (55B) by 16.5% success rate on a benchmark of 29 manipulation tasks, while requiring 7× fewer parameters. It supports efficient fine-tuning methods such as LoRA, enabling deployment and customization on consumer GPUs.

Datasets: The foundation comprises Open X-Embodiment (OXE), containing over 1 million trajectories from 22 robot types, and the DROID dataset for real-world manipulation.

Hierarchical Systems

The hierarchical paradigm in VLA architecture separates robot control into two levels: high-level task planning ("System 2") and low-level action execution ("System 1").

High-Level Planner (Task Planner): Decomposes complex, long-horizon instructions (e.g., "clean the room") into sequences of simple subtasks or natural language instructions.

Low-Level Policy (Control Policy): Accepts subtasks and visual data to generate specific control commands (rotation, translation, joint forces).

SayCan

An architectural paradigm where the language model decides what to do (Say), while a second module handles low-level skills (Can) — navigation, rotation, grasping, etc. The low-level policy (BC-Z or MT-Opt) provides an affordance function estimating the probability of successfully executing that skill in the current environment state (world-grounding).

At each step, SayCan:

Prompts the LLM to propose possible next actions.
For each action, queries the corresponding skill: "How likely are you to execute this right now?"
Multiplies: — Language probability (Say), — Executability probability (Can).
Selects the action with the maximum joint score.

Here, the LLM and skills are not trained jointly. The LLM is either frozen or fine-tuned separately. The skills themselves are trained independently — via imitation learning or reinforcement learning.

Drawbacks: The approach is considerably slower; errors in subtask execution can trigger replanning cycles, causing motion pauses; the planner may generate subtasks physically impossible for the low-level system.

Advantages: Scalability (large LLMs can be used for planning), long-horizon capability (extended planning).

HiRT

An example of the SayCan paradigm is HiRT. However, it is more neural, less "glued together from modules", and moves toward end-to-end reasoning while formally remaining hierarchical / planning-driven.

Architecture: Visual Encoder: Employs InstructBLIP, specializing in visual scene understanding within text instruction contexts. Language Encoder: LLaMA-2 serves as the primary "brain" for command processing. Action Decoder: Rather than complex diffusion systems, HiRT uses an MLP head (latent-conditioned policy head) generating commands based on latent representations from upper hierarchy levels.

Gemini Robotics 1.5

A contemporary embodiment of this paradigm is Gemini Robotics 1.5, adopting an "agentic" approach:

Orchestrator (GR-ER 1.5): Handles high-level planning and can query external tools (e.g., Google Search) for task rule comprehension. • VLA Component (GR 1.5): Executes plans using "embodied thinking" — generating internal reasoning chains about movements before executing them.

Modular Paradigm

The system is assembled from separate, often off-the-shelf components such as large language models (LLM) and vision-language models (VLM), connected together. Unlike end-to-end systems, modular architectures often do not require large-scale fine-tuning of the entire pipeline and allow flexible replacement of individual system parts.

Instead of passing vectors (embeddings) within a single network, modules communicate at a "human" level. For example, an LLM writes program code that calls a visual module for object detection, then passes coordinates to a controller.

VoxPoser

The system uses an LLM (e.g., GPT-4) to translate user instructions into executable code that invokes a VLM to obtain object coordinates.

Based on VLM data, two 3D maps are created: an affordance map indicating where the robot should be positioned, and a constraint map for obstacle avoidance.

The resulting maps are passed to an external model predictive control (MPC) controller, which computes an optimal real-time grasping trajectory for the robot.

VoxPoser succeeds at complex manipulation tasks in zero-shot mode (without prior training on specific robot data).

SVLR

The core idea of SVLR is dividing the control process into clear stages without requiring classical end-to-end training (training-free approach).

The model uses visual prompt retrieval to execute control strategies. Rather than learning to generate actions from scratch, the model matches recognized image segments with predefined action libraries.

The visual component is based on Mini InternVL, while language processing uses Phi-3-mini. The visual encoder (Mini InternVL) segments the image into objects. The system compares obtained segments with a library of visual prompts. Once a matching visual segment is found in the database, a script-based action binder activates a specific control strategy that maps recognized scene segments to concrete robot commands.

During operation, the robot does not receive new external prompts but "retrieves" them from memory (retrieval mechanism).

The system uses datasets consisting of self-collected visual prompts that help the model associate visual patterns with physical objects in space.

The "Atypical Action" Problem: If a human requests something not in the database (e.g., "throw the apple" instead of "place on table"), the modular system may encounter difficulties. If the skill "throw" was not pre-programmed or collected in the dataset, the robot simply will not understand what trajectory to build. However, the LLM may attempt to find the closest analog (e.g., "throw a ball") and apply that visual template to the apple segment.

Diffusion Policies

Reconceptualizes action generation as an iterative denoising process of random noise. Unlike traditional models predicting discrete tokens, this approach enables generation of smooth, continuous trajectories (modeling action distributions).

Within this paradigm, robot control is viewed not as next-step classification, but as a conditional data generation task.

The model begins with a noisy action sequence and gradually refines it. Continuous output avoids discretization (quantization) problems that in models like RT-2 may lead to reduced spatial precision or temporal resolution.

Octo

A large open-source diffusion generalist policy developed at UC Berkeley.

It uses a transformer as the central hub, to which various visual data encoders and text instruction encoders can be connected through a block-wise attention structure.

Architecture:

Input Tokenizers — all inputs are converted to tokens (observation: image, history, text instruction)
Transformer Backbone. Provides understanding of what is happening in the scene and what task needs to be performed. Importantly, the architecture is designed so that new inputs can be added (e.g., additional camera streams or proprioceptive data) without modifying the transformer core.
Diffusion Readout Head. The diffusion policy learns to recover real actions from noisy versions (intentionally corrupted actions that we attempt to restore) using a DDPM approach (denoising diffusion probabilistic models). An adapter between transformer embeddings and the diffusion algorithm. We recover our action from noise: takes the transformer hidden state -> projects it into action space -> outputs noise estimate.

The model was trained on the Open X-Embodiment dataset spanning 22 different robot platforms. Instead of an autoregressive decoder, Octo employs a diffusion head that outputs continuous joint trajectories, ensuring smoother motion and rapid adaptation to new tasks.

At inference, it operates as follows:

Take clean noise (random trajectory)

Feed scene + instruction

Run the model multiple times

Model π₀ (pi-zero)

π₀ from Physical Intelligence represents an evolution of diffusion ideas focused on extreme dexterity.

Instead of discrete diffusion steps, π₀ uses flow matching — a continuous-time model trained to predict a vector field directing noise toward clean action values. Thanks to the transition to flow-matching, π₀ significantly surpasses Octo in dynamics.

The model is built on a pretrained PaliGemma backbone (combining SigLIP and Gemma), supplemented with a specialized action expert module using Mixture-of-Experts (MoE).

Since diffusion is rather slow, latency issues arise. For real-world tasks, a Real-Time Chunking (RTC) algorithm is used, which allows diffusion models to generate new action chunks in parallel with executing previous ones, eliminating pauses in robot motion.

Data

Open X-Embodiment

Contains over 1 million trajectories (episodes)
Aggregates data from 22 distinct robot types, including single-arm and dual-arm manipulators as well as mobile platforms
Covers over 500 skills and more than 160,000 unique tasks
Data collected across 311 different scenes
Provided in RLDS format and stored as sharded TFRecord files
Includes RGB images, depth maps, natural language instructions, and robot action vectors

Models trained on this data: OpenVLA, RT-X (RT-1/RT-2), Octo, pi-zero

Kaiwu

A large-scale real-world multimodal dataset introduced in 2025 for training robots in complex manipulation and human-robot interaction.

The dataset comprises 1 million multimodal episodes
Focuses on high-complexity manipulation and long-horizon planning tasks
Covers multi-embodiment operation across various robot types in real-world scenarios
Includes: RGB images, depth maps, and 3D skeletons of objects/agents
Physical signals: Haptic feedback, IMU data, and audio
Human biometric data: EMG (electromyography) signals, gaze direction, and motion capture
Natural language commands for control and interaction
Data stored in HDF5 format

RT-1-Kitchen

Approximately 130,000 episodes
Contains over 700 unique tasks formed from combinations of 12 core skills (such as "pick," "place," "open," "knock")
Includes manipulation of 16 different objects across 2 distinct office kitchen scenes
Comprises RGB images and natural language text instructions
Data collected manually via human teleoperation in a real office kitchen environment

Originally created for RT-1, subsequently used for RT-2, MOO, Q-Transformer, and RT-Trajectory.

DROID

A dataset for training robot manipulation in real, unstructured "in-the-wild" conditions

76,000 demonstrations
Data collected across 564 different scenes
Covers 86 distinct manipulation skills
Includes RGB images, depth maps, and stereo video
Each demonstration accompanied by natural language annotations
Data provided in RLDS format (same as Open X-Embodiment)

Used for training OpenVLA, FAST, V-JEPA 2

BridgeData V2

Contains 60,096 trajectories (episodes). Of these, 50,365 were collected via human teleoperation and 9,731 via scripted policies. Nearly all data was gathered using the WidowX robot manipulator
The dataset spans 24 different scenes, over 100 unique objects, and approximately 250 distinct skills
Includes RGB images, depth maps (D), and natural language instructions
Actions represented as continuous 7-DoF vectors (7 degrees of freedom), enabling description of complex spatial movements
Data serialized in TFRecords format

Dataset used for training: Octo, RT-X, Edge VLA, and ECoT

CALVIN and LIBERO

CALVIN and LIBERO are specialized simulation datasets and benchmarks.

CALVIN is built on the PyBullet physics engine using the Franka robot model. Contains over 5,000 demonstrations including synchronized RGB-D frames, force data, proprioception, and text instructions. Serves as a standard benchmark for models including GR-1, HULC, RoboFlamingo, RoboUniView, UP-VLA, and OE-VLA.

LIBERO was developed as a benchmark for investigating continual learning. The dataset includes 130 unique tasks. Data stored in JSON and Parquet formats, including images, action trajectories, and structured metadata. LIBERO demonstrates capabilities of modern architectures such as MDT, OTTER, VLA-Cache.

Simulators

In VLA, a simulator is a controlled "pseudo-reality" where an agent learns to act: receiving visual observations, executing actions, obtaining rewards and failures without a real robot.

Simulators are almost never written "from scratch." In 99% of cases, simulator development is assembly from ready-made bricks where physics, rendering, and sensors already exist, and the researcher only adds necessary logic and training interface.

At the core of nearly any modern simulator lies a physics engine. This is a large system that calculates body motion, collisions, forces, friction, inertia, and joints.

On top of physics, scene geometry is added. These are simply object meshes: table, cup, box, robot, floor. Each object is described primitively: shape (mesh or simple primitives), mass, center of mass, friction coefficients, elasticity.

The next layer is the robot model. The robot is described via URDF or equivalent: which joints, what constraints, what links, where motors. This description is then loaded into the engine, and physics itself calculates kinematics and dynamics.

After this, environment logic appears. These are very simple things: when an episode begins and ends, what counts as success, which instructions are permissible.

Typically, a simulator contains a scene (room, table, robot, objects), physics (gravity, collisions, friction), cameras and sensors. The agent receives observations such as RGB or RGB-D images, robot state (joint angles, velocity), sometimes scene description or instruction. In response, it outputs an action: move arm, rotate gripper, open gripper, etc. The simulator computes physics and returns the next state. Step by step, an episode unfolds.

There is another mode where the policy is trained directly inside the simulator through RL or its hybrids. Then the model acts, receives reward (task success/failure), and gradually improves. Often the approach is: first extensive imitation on simulation demonstrations, then some RL fine-tuning, then all distilled into a single VLA model.

Popular Platforms:

AI2-THOR. RGB, depth, segmentation, object states. Navigation and manipulation in photorealistic home interiors (base for ALFRED).

Habitat. RGB, depth, segmentation, agent poses. High-performance navigation in large-scale 3D scenes.

NVIDIA Isaac Sim. LiDAR, point clouds, forces/torques, RTX rendering. Industrial robotics, RL, digital twins.

MuJoCo. Contact forces, kinematics, RGB. High-speed simulation for precise control and contact dynamics.

SAPIEN. Segmentation masks, articulated objects. Manipulation with deformable and complex objects, dexterous grasping.

PyBullet. RGB, depth, contact forces. Rapid prototyping and long-horizon tasks (base for CALVIN).

Kinetix. Force control. Dynamic tasks (throwing, catching, balancing).

The main limitation of simulation preventing models from being trained solely on simulations is the sim-to-real gap. Models trained only in virtual environments often perform poorly on physical robots due to significant visual differences and physics inaccuracies.

Training

VLA models are almost never trained "from scratch" in the sense of everything at once. There are almost always pretrained blocks. The visual encoder is ViT, CLIP, DINO, SigLIP, or something similar. The language encoder is an LLM or its distilled version. These are not trained from scratch.

Training typically occurs in two stages: pretraining, where the model acquires basic world understanding; and finetuning (alignment) for the target domain. Sometimes RL or planning-guided finetuning is added.

In many cases, full parameter updates are unnecessary. Instead, methods are used that allow easy adaptation of the model to new tasks while preserving most knowledge acquired during pretraining. This approach minimizes catastrophic forgetting risk and makes finetuning more economical. One common method in this class is Low-Rank Adaptation (LoRA).

Instead of updating all transformer parameters, LoRA adds two small low-rank matrix adapters to each layer. The main weights W remain frozen, while only these adapters are trained.

There are enhanced versions of LoRA, such as QLoRA, which combines LoRA with 4-bit quantization. This enables simultaneous adapter training and base model compression, reducing memory requirements without performance loss.

An alternative training approach is freezing all VLM parameters and updating only the Action Decoder. In this configuration, the VLM serves as a feature extractor, extracting features from visual and textual information that the decoder then uses for action prediction. (RT-2, Lingo-2)

Batch size is usually small because images, video, diffusion, and context are quite memory-intensive.

Loss

Loss is always determined by how the action is represented.

If continuous action space and diffusion policy (Octo, pi-zero, part of Gemini Robotics), then the base loss is exactly the same as in image diffusion models, except instead of an image, it is an action vector. The true action from the demonstration is taken, Gaussian noise is added at step t, and the model must predict the noise. Loss is standard MSE between predicted and actual noise.

If diffusion predicts not one step but a chunk of actions (e.g., trajectory for 8–16 steps), then MSE is simply summed over time and across all action coordinates.

If actions are discrete or quantized (as in Gato, OpenVLA, some RT-like models), then it is standard cross-entropy. The action is represented as a token or token sequence, and the model learns to predict the next token as in a language model.

Besides the main loss, auxiliary losses are often added. The most common is vision-language alignment. For example, contrastive loss (CLIP-like) so that scene images and instruction text live in the same space.

Sometimes state prediction loss is added: the model must predict the next robot state or scene changes. This works as regularization, helping the model better understand world dynamics.

In models with planning or hierarchy, there is sometimes loss on high-level intent. For example, first a latent goal or subtask is predicted, then the action. Then there is loss on this latent level, often also MSE or cross-entropy.

Metrics

The most basic and important metric is success rate. A set of tasks or instructions is taken, the model is run in simulation or on a real robot, and the fraction of episodes where the task is completed is counted. For example: "put the mug on the table," "open the drawer," "move the object." If 63 out of 100 attempts succeed, success rate equals 63%. This is a rather coarse metric.

But success rate is binary. Therefore, partial success or task-specific metrics are often calculated. For example, distance to goal, how well the object is oriented, whether it was held for N steps, how smooth the trajectory was.

A very important metric in VLA is generalization. The model is tested on: new objects, new scenes, new instruction formulations, combinations not seen during training. And the drop in success rate is observed. If the model "only knows what it saw," it is useless as a generalist policy.

In real robots, another category of metrics appears: safety and stability. How many times the robot dropped an object, got stuck, or performed a physically dangerous movement.

There are also language alignment metrics. For example, the model performed an action, but not the one requested. Formally there is success, but the instruction was violated. Therefore, human evaluation is sometimes added: a person watches the video and answers whether the behavior matches the instruction, though this is quite expensive.

Throughput: A combined metric of speed and performance (number of completed subtasks per minute). It is critically important for real-time systems.

Cumulative Progress: Assessment of what portion of a complex multi-step task the robot managed to complete before time expired.

Resource Requirements (GPU)

The number of required GPUs depends heavily on model size and training stage:

Giant Models (Foundation Models): Training models like RT-2 (55 billion parameters) requires colossal computational power and time, available only to major laboratories (Google DeepMind).

OpenVLA (7B): Can be finetuned on consumer GPUs (e.g., RTX 3090/4090 level) thanks to using LoRA (Low-Rank Adaptation) and quantization (approximately 24 GB VRAM required).

SmolVLA (450M): Specifically designed to "democratize" research; it trains on small datasets and can run even on CPU or a single consumer GPU (<24 GB VRAM).

Optimization

VLA combines visual encoder, language LLM, and Action Decoder, with total parameters ranging from several billion to hundreds. Therefore, model inference can take considerable time, during which the environment and robot state may change.

This is often addressed through architectural improvements (e.g., Mixture of Experts) and computation optimization on the target device via quantization and GPU-efficient inference.

Modern research focuses on solving latency and accuracy problems:

Action Chunking: The decoder predicts not one action but a sequence (chunk) of future steps. This provides temporal stability and allows the system to "think" about the future while executing the current movement.

Applications

Application Areas of VLA Models

In 2025, VLA-equipped robots are actively being deployed in automotive factories (e.g., Tesla Optimus in battery assembly shops or Apollo at Mercedes-Benz plants).

Models like Helix enable two humanoid robots to work together on one long task, coordinating arm and body movements in real time.

One of the most important characteristics of VLA (especially RT-2) is the ability for emergent skills not present in the robot's training data. For example, the robot can use semantic reasoning to choose a rock as an "improvised hammer" or offer an energy drink to a "tired person," using knowledge acquired from the internet.

Contents