🚀SkillOpt: Integrating Skills into Agents

Introduction

In the last two years, we have seen the rapid development of AI agents. Especially the modern agents, which can search the web, analyze the documents, write code, work with spreadsheets, and even interact with software environments. However, there is a problem.

For Non-Members: Read Here!

Most agents are still heavily dependent on the knowledge already stored inside their model weights. When we want an agent to perform better on a specific task, we usually have three choices:

Write instructions manually.
Generate instructions using another LLM.
Fine-tune the model.

Each approach has limitations.

Manual instructions take time and require expertise. One-shot generated instructions often become outdated or fail in new scenarios. Fine-tuning is expensive, especially when working with closed-source frontier models.

What is SkillOpt?

The SkillOpt paper introduces a different idea, where instead of training the model, we train the skill.

It treats skill as a separate document containing procedures, rules, tool-usage guidelines, formatting instructions, and domain-specific knowledge. Rather than changing the model weights, SkillOpt continuously improves this skill document using feedback from the agent's performance.

The result is a system that behaves much like gradient descent in deep learning, but instead of optimizing numerical parameters, it optimizes the text.

This simple idea turns agent skills into trainable assets.

Challenges with Existing Approaches

To understand SkillOpt, we first need to understand how the existing methods fail.

1. Hand-Crafted Skills

When working with agents, we write out the instructions manually based on our task, which we often call it as System Prompt, or more recently, Skill.

For example:

Always verify spreadsheet formulas.
Search multiple sources before answering.
Use specific formatting for outputs.

These instructions can help initially, but they do not improve automatically when the agent makes mistakes. Every improvement requires manual intervention by a human expert.

2. One-Shot Skill Generation

Sometimes, even the task of skill generation is delegated to LLM. In this, we ask a powerful LLM to generate a skill document. The problem is that the generated skill is created even before observing the real failures.

If the agent repeatedly makes mistakes, the skill remains unchanged.

3. Prompt Optimization Methods

A lesser-known method of prompt optimization is TextGrad and GEPA, which optimizes the prompts using feedback.

While useful, they mainly focus on prompt evolution rather than maintaining a reusable skill artifact. The resulting improvements are often tied to a particular setup.

4. Skill Evolution Systems

There are other methods, such as Trace2Skill and EvoSkill attempt to learn the skill from execution traces. However, they lack some of the controls commonly used in machine learning optimization:

Learning rates
Validation checks
Controlled updates
Rejected-step memory

As a result, large skill changes can sometimes make the performance worse.

If skills are the main adaptation layers for agents, then skills must be trained with the same discipline as neural network optimization. This is the core idea behind SkillOpt.

SkillOpt: The Core Idea

The key insight behind SkillOpt is surprisingly simple: Treat the skill document as the trainable state of the agent.

Here, the target model remains frozen, and a separate optimizer model analyzes the agent's trajectories and proposes modifications to the skill document.

Agent-Skill optimization loop looks like this:

Run the agent with the current skill.
Collect successful and failed trajectories.
Analyze the trajectories.
Propose edits to the skill.
Validate the updated skill.
Accept the skill only if performance improves.

This creates a feedback loop similar to training neural networks.

Deep Dive into SkillOpt

Step 1: Collect Rollout Evidence

First, the agent executes the tasks using the current skill. During execution, SkillOpt collects:

User interactions
Tool calls
Intermediate reasoning
Final outputs
Evaluation scores

These collected trajectories become the training data for skill improvement. You can think of this as collecting the training examples during reinforcement learning.

Step 2: Reflection and Error Analysis

SkillOpt separates the successful and failed trajectories, and the optimizer model analyzes both these trajectories.

Failures reveal missing procedures.
Successes reveal behaviors worth preserving.

Instead of focusing on individual examples, SkillOpt processes mini-batches of trajectories to discover recurring patterns. For example:

Suppose an agent repeatedly fails spreadsheet tasks because it assumes formulas will automatically recalculate. The optimizer may learn a new rule:

Write evaluated static values rather than relying on recalculation.

This rule becomes part of the skill document.

Step 3: Controlled Skill Editing

Most previous approaches rewrite large portions of the prompt, but this changes with SkillOpt.

SkillOpt performs structured edits like ADD, DELETE, and REPLACE. The system introduces a concept called a textual learning rate, where instead of allowing unlimited modifications, only a limited number of edits are permitted during each update.

This is similar to how gradient descent uses a learning rate to avoid unstable jumps. Benefits include:

Stable optimization
Reduced overfitting
Easier debugging
Better preservation of useful rules

Step 4: Validation Gate

Every proposed skill is evaluated on a separate validation set. If the updated skill performs better, then accept the skill; otherwise, reject the skill.

This prevents harmful modifications from entering the final skill. Without this step, an optimizer might continuously introduce changes that appear useful but actually reduce performance.

Step 5: Rejected Edit Buffer

Even the rejected edits are not discarded; instead, the SkillOpt stores them inside a rejected-edit buffer.

Future optimization steps can inspect this history and avoid repeating the same mistakes. This creates a form of negative feedback memory.

The optimizer learns not only from success but also from failure.

Step 6: Slow and Meta Updates

SkillOpt introduces a second learning mechanism called the slow/meta update. At the end of each epoch, SkillOpt compares:

Previous skill versions
Current skill versions

From which, it identifies the improvements, the regressions, persistent failures, and stable successes.

These long-term observations are converted into high-level guidance for future optimization. You can think of this as a momentum mechanism for skill learning.

Results and Evaluation

The evaluation was done extensively with various model comparison and benchmark datasets.

SkillOpt was tested across:

6 benchmarks
7 target models
3 execution environments

Main Result: SkillOpt achieved the best or tied-best performance in 52 out of 52 evaluation cells. That is an unusually strong result.

Cross-Model Transfer

Another impressive finding is transferability. A skill optimized for one model often improves another model. For example:

A SpreadsheetBench skill trained on GPT-5.4 improved:

GPT-5.4-mini
GPT-5.4-nano

without additional optimization.

This suggests the learned skills capture reusable procedures rather than model-specific tricks.

Cross-Harness Transfer

SkillOpt was also tested by transferring skills between:

Codex environments
Claude Code environments

Some transferred skills produced gains exceeding 50 points over the baseline. This is important because it means organizations can optimize a skill once and reuse it across multiple agent platforms.

Why SkillOpt Works

Four key reasons why SkillOpt works:

Bounded Updates: Small edits prevent catastrophic changes.

Validation Gate: Only beneficial skills survive.

Rejected Edit Memory: Past failures become learning signals.

Slow Meta Updates: Long-term patterns are preserved.

Removing these components causes noticeable performance drops, especially on complex procedural tasks.

Pros of SkillOpt

No Model Fine-Tuning Required: It works with closed-source models.
Transferable Skills: We can move the Skills across models and environments.
Human Readable: The final skills remain compact and interpretable. Most learned skills contain only a few hundred to a few thousand tokens.
Stable Optimization: Learning-rate controls and validation checks reduce instability.
Zero Deployment Cost: The optimizer runs only during training.

At inference time, only the optimized skill is used.

Cons of SkillOpt

Requires Large Training Runs: Generating trajectories and evaluations can be expensive.
Needs Strong Evaluators: The validation gate is only as good as the evaluation metric.
Additional Optimization Infrastructure: Organizations must build:

Rollout collection

Reflection pipelines

Validation systems

Skill management workflows

Not a Replacement for Fine-Tuning: Some knowledge deficiencies still require model updates. Skill optimization mainly improves procedures rather than core knowledge.

Final Thoughts

SkillOpt introduces an important shift in how we think about agent improvement.

For years, the default assumption was that improving agents meant improving model weights. SkillOpt shows that another path exists, i.e., improve the skill instead.

By treating skills as trainable artifacts and optimizing them with concepts borrowed from deep learning, like learning rates, validation gates, feedback loops, and controlled updates, the system achieves remarkable gains without touching the underlying model.

The results are impressive: best-or-tied performance on all 52 evaluated settings, significant improvements across multiple benchmarks, and successful transfer across models and execution environments.

The biggest takeaway is not the benchmark scores. It is the idea that agent skills can become a new optimization layer between prompts and model weights.

As AI agents become more common in production systems, methods like SkillOpt may become one of the most practical ways to continuously improve agent behavior while keeping deployment simple, interpretable, and cost-effective.

Digital Products

ML Interview Book: Crack Your Next ML Interview with Machine Learning Interview Playbook

Productivity Tool: Social Media Time Tracker: Take Back Your Time, a tool that annoys you when you log in to social media sites. Chrome Extension.

Connect with the author here