Moving beyond Git — towards a new System of Record for Software Operations

You may have read that GitHub is struggling with scale and that GitLab has to change due to AI. This is due to exponential growth in agentic software delivery. And it is highly relevant to any business adding AI and GPU workloads to Kubernetes platforms.

When we founded ConfigHub, we knew that customers have been struggling with the operational complexity of cloud native platforms. Devops and Kubernetes involve a lot of dev-facing tools which were never designed to be low cost and simple for ops teams.

We diagnosed that one cause of this is how Git — originally a source control tool — is now part of every change in modern software applications. If Git is on the critical path for every single change in software code, configuration, live software operation, compliance and audit, then outages, delays, obfuscation and toil are to be expected.

Now add AI into DevOps. Today we live in a world where Mythos-grade AI means that software releases can be breached within hours of release and must be patched just as fast. An arms race. What does this do to cost and risk?

Early surveys already show impact (DORA). At ConfigHub we have seen the following:

Operations desks drowning in thousands of issues and expecting 10× growth due to AI
Delivery outages often lasting multiple days due to hard-to-trace YAML file errors
High-value new code for the platform — AI or CVE remediation — that can't be rolled out at all due to continuous breakages

Blast Radius

A large organisation runs around 1,000 delivery teams, each making about ten configuration changes a day — roughly 2.5 million a year. Each change is rendered across environments, clusters, and regions. The surface a single change can break runs from tens of millions of config values a day into the billions on the largest estates.

That surface is the blast radius, and it is hard to estimate from looking at files in Git. Not all changes break, but enough do. DORA reports one in ten deployments fails, much of it config-related. Teams catch only half to two-thirds before production. A single team lets a handful of config faults reach production a year; across 1,000 teams that is several thousand incidents a year.

The Uptime Institute reports that most significant outages now cost over $100,000, and about one in five cost over $1 million. Across the full volume of incidents — the escalations, blocked rollouts, incident response, and standing ops desk toil — a large organization carries on the order of $300,000 per team in configuration-caused costs annually, before reputational or regulatory consequences.

AI at least doubles the volume of change merged today, and likely more as agents take over authorship. At the same time, DORA 2025 finds that a 25% rise in AI adoption tracks with a 7.2% fall in delivery stability: more change, produced faster, at a higher failure rate, with thinner human review.

This compounds well beyond 2x:

Incidents head toward 15,000 a year (from a few thousand today)
The ops desk balloons from 10,000 tickets to 50,000
Annual costs climb several times over — toward $1 billion and beyond

As one customer put it: "We'll go from 10,000 tickets to 50,000 in no time."

Even good old regular internal IT is at significant risk as the pace of changes (and errors) increases, because programmers think it clever to reuse software, data, and services. Of course Git is the ultimate connected system in DevOps.

Git was designed for humans writing and sharing source code. Operations requires an API that agents can use directly (not PR workflows); visibility into blast radius before change ships (not after rendering in clusters); bounded writes where agents can only touch the configs they control (not re-render all siblings); and proof chains showing who changed what, when, under which policy, with signed evidence. A database-powered operating model, not a filesystem, is what solves this.

It was 20 years ago today… (*21)

On 7th April 2005, Linus Torvalds launched Git, a new system for programmers to share source code. Git is distributed meaning that code is written and stored in multiple locations. There are many requirements for such a system. Among those requirements, we wish to highlight that software is versioned and every version has a complete development history.

Today we see a growing need to use and support 'Git type' features in the world of software operations as well as development. This is especially true with AI and agentic software delivery where, for example, automated workflows operate faster and at greater scale than before, leading to a need for distributed work, checkpoints, and auditing.

In this document we argue that a new system for software operations management is needed. This can be 'inspired by Git' but instead of tracking versions of source code it must track deployments of software, their variant configurations and the dependency relationships between them.

Why? Because every software operation changes a property of a live system. For such operations to work we must know how a change to one deployed component impacts others. Consider rolling out a patch through dev, staging and production, to a fleet of applications. Or a more complex example: cost optimization.

From development to operations

This journey goes much deeper and further than just AI agents.

How did we get here and what are the requirements for modern operations? What is the difference between source code and 'operated' software? How does change occur, and what processes and data relationships are involved? Can a complete operational history be kept?

Two evolutions should be highlighted: centralization, and configuration management.

Centralization: Linus imagined Git used by builders of open source projects like Linux. But open source was also being used by teams to build websites and startups. They too started using Git and adapting it to their workflows. A team in SF started GitHub, adding a cloud-hosted central store, with a mission to give people tools and get out of their way.

Configuration Management: Git, GitHub and others became 'the place' where developers shared all kinds of technology: packages, notes, docs, and operational artifacts. Most important among these are configuration files which determine how most modern software operates. The growing need for configuration management was highlighted by Google as early as 2016.

By 2018 the use of Git for config was being combined with Kubernetes and containers, and became known as GitOps, an operating model for cloud native workloads. It was known that the Git implementation as a file system might not be long term optimal, yet config files proliferated, leading to "YAML Hell" where ops teams spend days looking for one YAML error.

GitOps vs ClickOps in 2026

Let's take a look at how software operations work by contrasting two different operating models: GitOps and "ClickOps". GitOps has benefits, as set out elsewhere. But our job today is to highlight a few cases where the model creates operational frictions.

"Have you tried turning it off and on again?" has been a mantra of IT ops since the last century. The fact is that when something goes wrong, the time available to fix it may be short, but the solution may not be immediately obvious. Experiments like restarting a component are often part of how we discover which part of a live system isn't working. Most operations teams, if asked, would prefer this level of direct control. Moreover they usually prefer having a control UI — an "ops dashboard" with buttons they can press to toggle configuration settings and reset parts of the system they're responsible for. This operating model is called "ClickOps".

In contrast with ClickOps, the vanilla GitOps model requires all changes to route through Git. From the point of view of a human operator used to ClickOps, they are being asked to make changes indirectly, via another store (e.g., GitHub) instead of affecting the deployed config and ops state. Not to forget: Git has its own workflows and approvals (PRs), data organization (files) and transaction model (versions in a merkle tree) — all of which create operational friction points.

As one of the authors will attest, the most common complaint from serious GitOps users is "what happens when I need to make immediate changes directly to the live system?" Users do need an answer to this that can be enacted consistently, immediately and now agentically. An ideal operational solution would support 'symmetry': changes made by operators, agents, developers, managers are equally "direct" and use an operational API they understand.

The Operations API

What should an API look like for operations? What does it require?

First, it would directly operate on configuration for the reasons laid out above — to support both 'direct' ClickOps and the wider set of user personae including machines. For the same reason, it should add value to existing GitOps: for example, Git's existing 'api' isn't suited to finding and upserting multiple values; or it could interact with GitOps to add reverse flows from live updates.

All that means our API must create, read, update and delete deployment Variants, understanding changes like promotions, replication and upgrade; and relationships like upstream, downstream, provenance, and closure. And to be reliable in automated settings with AI tools this API should be able to provide evidence of change, signed, authenticated receipts.

Helm is not an API

For many the "Operations API" has become "Update Helm, wait for CI, and hope that everything renders in the cluster". Or, replace Helm with another IaC tool like Terraform. This is broken.

Real systems consist of multiple variant deployments as stated above. Each variant has a unique configuration. That config is written by developers, using tools and language formats ideal for development via Git. Often that tool is a template language such as Helm. While Helm has been a huge success for packaging and people need this, it can complicate operations.

For operations staff, a 2,000 line Helm chart is not easily understood, if it can be located in the first place. Unlike ClickOps, where we can open a UI and update nodes directly, when we use GitOps and Helm or IaC, there is no immediate trace: "this output came from this input". Without such a link back we have no provenance, no root cause analysis and no direct bug fix.

The problem underneath is structural. Helm, like every "as code" tool, renders many configs from one source file. But an operation has to change the configuration of one specific deployment — this variant, in this region, now. The only editable surface is the shared source, and changing it re-renders every config downstream. So a fix to one variant cannot be contained: it carries the blast radius of all its siblings.

From Fan-Out to Blast Radius

The combination of files and templates is dangerous:

Fan-out is how many live targets a single source expands to: one definition rendered across environments, clusters, regions and tenants Density is how many configuration values sit inside each rendered component Blast Radius = Fan-out × Density = a key risk factor because it counts the live values a change can touch rather than the edits a human made

This is also why the file based model cannot contain a fix: it only ever exposes the shared source, while the thing that actually breaks is the rendered fan-out it cannot see. A store that holds the rendered output configuration as data can answer the four questions operators really have: What is the blast radius of this change before it ships? Can we review the rendered result across every target, not just the template? Can we roll back to a known good fleet state rather than a source commit? And can we show what changed, where, and by whom as signed evidence?

For this reason, advanced devops experts recommend the rendered manifest pattern and gitless gitops. In practice implementing these requires building a kind of database workflow out of the wrong tools: templates, pipelines, files, filesystems, and the git protocol. Updates require generation of new configs from source, turning operations into guesswork with high blast radius.

Database vs Filesystem

Our store should be API-first, AI native and on premise. Why can't we just implement this in Git, or "on Git", or "with Git", and attach bells and whistles like OCI gateways, signing and authentication? To some extent these are adoption questions: users will want tools that work with Git and OCI at key points. But we are separating config lifecycle from source code, just like OCI container management split from Git.

All changes should exist in both a non-repudiable revision tree, as well as a bidirectional variant graph. That gets us away from one-way flows and the loss of information due to generating config from source code, rendering in pipelines and live clusters. Concurrent reconciliation must be possible: consider that in many systems, multiple reconciliations are running in parallel.

At ConfigHub we are working on this. We see the product as the authoritative store for operational configuration and the act of changing it. It owns the desired state, the apply path, the bounded write, the attribution-and-receipts layer, and the reverse reconciliation from live back to intent. It answers the operator's and the agent's real questions: what will this change do to the running fleet, who is allowed to make it, and prove it before it lands. Other systems index a graph of what is. This one owns the write that changes it.

Software Operations in the Agentic Era

In this document, we have not built our story around agents per se. This is because most of the changes agents are bringing have been clear since Git became central to operations. What agents more than anything will do is accelerate change, acting as a forcing function against a backdrop of crises.

AI agents are coming to Kubernetes configuration. They will bring velocity and automated compliance, plus a wider range of deployment models: new GPU-based stack configs, single and multi-tenant sandboxes, automated testing, prompting and context. This means: AI will create many more apps and workflows, leading to even more deployed variants; the API and associated machinery must scale; potentially to legacy stores (via protocol). Verification and audit must be built in via signed receipts and proofs of completion.

Contents