A few days ago I came across the a very interesting github project called 'Self Improving Coding Agent' (SICA) and I decided to study its github repository.

The basic idea behind the project is to have an agent that is not only capable of accomplishing specific tasks, but is also capable of improving its capability of accomplishing the task, by modifying its source code. Essencially, the agent is capable of rewriting itself, and is also capable of creating tools to make it better at rewriting itself, what the author called meta-improvement.

For you to have a general idea of what the agent was capable of accomplishing, I will list a few tools it developed by itself:

  1. The agent utilized a basic file-ovewriting approach for code changes
  2. It developed a Smart Editor capable of more intelligent contextual edits
  3. It developed a Diff-Enhanced Smart Editor, to incorporate targeted modifications and pattern-based editing
  4. It them wrote a Quick Overwrite Tool to reduce processing demands
  5. It then implemented a "Minimal Diff Output Optimization" and "Context-Sensitive Diff Minimization" using an Abstract Syntax Tree (AST) parsing for efficiency.

Several other changes have also been idealized and implemented by the SICA. The paramount idea is the fact that it not only was able to implement, but also able to idealize the changes.

I decided to study the project and give a few shots on my own implementations of it.

The architecture

First and most important thing about this project:

If you ever want to run a project like this, you should run it inside a docker container. This agent has the ability to create, delete and modify files. It is a terrible idea to run it directly on your machine.

First, there is the problem statement, that is a problem that you and me pass to the agent. Then there are two main agents involved. The Overseer and the Main Orchestrator.

The Overseer is responsible for asynchronously monitoring the self improvement process, to intervene if the process somehow degenerates. It can not be modified.

The Main Orchestrator is the one responsible for solving a problem statement.

Agents are listed to other agents just as if they were tools, and an agent (with exception of the Main Orchestrator), can be called by another agent to solve a problem.

In order to check whether the agent is becoming better or worse at problem solving, it is tested across a set of benchmards. These scores define numerical values for the agents performance.

Other agents involved:

General Problem Solver

Software Developer

Archive Explorer

Reasoning Agent

Committee Member

```
                    ┌─────────────────────────────────────────────────────────┐
                    │                         main                             │
                    │              (MainOrchestratorAgent)                     │
                    └─────────────────────────┬───────────────────────────────┘
                                              │
          ┌───────────────────────────────────┼───────────────────────────────────┐
          │                                   │                                   │
          ▼                                   ▼                                   ▼
┌─────────────────────┐           ┌─────────────────────┐           ┌─────────────────────┐
│ general_problem_    │           │ software_developer  │           │ archive_explorer    │
│ solver              │           │ (CodingAgent)       │           │ (ArchiveExplorer)   │
└──────────┬──────────┘           └──────────┬──────────┘           └──────────┬──────────┘
           │                                 │                                 │
           │    ┌────────────────────────────┤                                 │
           │    │                            │                                 │
           ▼    ▼                            ▼                                 ▼
    ┌──────────────┐                 ┌──────────────┐                  ┌──────────────┐
    │ reasoning_   │                 │ reasoning_   │                  │ reasoning_   │
    │ agent        │                 │ agent        │                  │ agent        │
    └──────────────┘                 └──────────────┘                  └──────────────┘
           ▲                                 ▲                                 ▲
           │                                 │                                 │
           └─────────────────────────────────┴─────────────────────────────────┘
    
Via ReviewCommittee tool only (not direct AGENT_CALL):
main, general_problem_solver, software_developer
                    │
                    ▼
          ┌─────────────────────┐
          │ meta_agent_design_  │
          │ reviewer            │
          │ (CommitteeMember)   │
          └─────────────────────┘

```

What kind of benchmark is involved?

GSM8KBenchmark — Grade-school math word problems from the GSM8K dataset. Tests multi-step arithmetic in natural language. Requires a single numeric answer; scoring is exact match.

DROPBenchmark — Discrete Reasoning Over Paragraphs. Reading comprehension where questions need numerical or structural reasoning (counting, sorting, etc.) over passages. Uses F1 and EM; answers may have multiple valid forms.

ARCAGIBenchmark — Abstraction and Reasoning Corpus. Pattern completion on integer grids: given input/output examples, infer the rule and produce the correct output grid. Evaluates abstract pattern recognition.

MATHBenchmark — Hendrycks competition math problems (algebra, geometry, etc.) in LaTeX. Supports complex answer types (numbers, fractions, sequences, matrices, intervals) and scores via standardized format and numeric equivalence.

I should point out that any improvements in the benchmark are essencially a result of agentic design and prompt engineering, as the LLM deployed is not being chaged.

The Loop

The loop interwines two different loop runs.

  1. Solving the Problem Statement
  2. Improve itself in Solving the Problem Statement.

The solving problem runs are very simple and in these runs, the agent only tries to solve the stated problem.

The Self Improving runs are the interesting ones. During these runs, the agent will evaluate the task at hand, its current architecture, and its result in the benchmarks that it has at hand. It will then propose changes on its own architecture and approve, or reject them, by leveraging the CommitteeMember, which analyses the proposed changes.

My own experience with this project

Essencially, I can only tell you I am only experimenting with with. I did try to run it. I was unable to make it work just, and it might need some bugfixes. Errors kept popping up, some of them were due changes in the providers APIs (or at least so does it seem to be). And even when I could fix these issues in order to run it locally, still, I was unable to see it making any changes.

My experiments trying to implement it

I did try to make a simple implementation of the same project. Eventhough I am still experimenting with it, and do not have a functioning version so far, I would like to share what I have learned so far. Since I was about to run off of credits on claude code, I decided to stop for now and write this article.

I always try to deploy the simplest model possible when executing a task, so that costs are much smaller. Therefore, I tried to use deepseek-coder:6.7b, which I can run locally. It is a powerful model when it comes to writing code. Since it is a smaller model and has reduced context capabilities, I thought that asking it to break complex tasks into smaller simpler ones would be a solution.

Unfortunatelly, it did not work. The ultimate problem was the fact smaller LLM models do not have meta-reasoning capabilities. They are weak at thinking about thinking, as to say, the agent could see and fix the error, but it would not be able to understand 'why did I make this mistake'? Smaller LLMs (or SMLs), do not possess this skill, which only starts appearing when LLMs reach around 40B params, only reacheable with industrial hardware.

Conclusion

Eventhough I was unable to see it in action, and could not do it my self, so far, seems to be a facinating project, which I intend to further explore. So far, we are still responsible for architecturing the tools we use, even when using AI. We still write most of the prompts and controll the progress of our code bases, what comes in and what does not.

This opens strategy opens new territory, where we set a final goal, and no longer oversee the agent, but let the agent oversee itself, and decide how it will be architectured.

Sources

Antonio Gulli book about agentic design patterns — https://amzn.to/3O7f14M

The github repository of the self-improving coding agent — https://github.com/MaximeRobeyns/self_improving_coding_agent