Data Testing like it's not 1997 — Part 1

We've had a lot of progress in improving how we test software from twenty years ago, but data has always lagged behind. According to…

Richard Glew

~12 min read · December 8, 2025 (Updated: December 8, 2025) · Free: Yes

We've had a lot of progress in improving how we test software from twenty years ago, but data has always lagged behind. According to Harvard Business Review, only 3% of the average companies data meets basic data quality standards. Based on my long experience working with hundreds of companies big and small, this feels depressingly accurate. The real world impact to companies of this is grossly underestimated and not appreciated. We have to do better, and we can.

I'm going to take you through how we can apply the lessons learned from software testing to data and how to approach data quality from the ground up.

Foundations of Quality

The first thing we need to be thinking about is what is quality? A common misperception is that testing and quality are the same thing — they are not.

Quality is some property of the system or thing we care about, Testing is how we assert that whatever we have, provides an acceptable degree of the quality we are looking for. In other words, quality is the thing we want and testing is how we see if we have it.

A structured approach to quality involves talking to a lot of people and finding out what qualities mean to them and what they need from, in our case the data, in order to for it to be useful to them.

A great tool I have used for a long time to capture this is something called a Quality Strategy. A Quality Strategy (QS) can be created for anything where you want to define and assure quality is as expected, but for Data Engineering specifically we are talking about the typical ETL (Extraction, Transform, Load) or much better, ELT workflow most companies employ through data processing pipelines.

What is a Quality Strategy?

A QS is a document that defines a broad framework for:

What qualities or properties we want to verify
Why we care about the particular quality dimension
How and where we are going to verify it
When we are going to verify it
Who is responsible for doing it

It can and should be, lightweight and easy to understand. This is the foundation for defining what quality means in general for data engineering workloads.

The QS itself should only contain the skeleton of what will become individual test strategies. This is to prevent the QS from becoming too large and cumbersome. It merely provides the framework that individual test strategies that are pipeline specific are derived from. As a side note, you don't have to have a higher level QS. You might get away just doing a project or pipeline level one but I have found that as this is a new concept for many, having a templated version makes getting the specific one you need at the lower level easier and more consistent across teams.

It is important to understand that while many qualities are fairly universal, each individual data pipeline is likely to have differences and specific requirements. One example could be for data needed for real time analytics. This kind of data often has a latency requirement where data coming from a source is needed in near real time. This would be very different from the latency needed from a system that only gets updates once a week. Similarly, correctness which is a universal need in data, may have different requirements. The data for your bank account balance or salaries in a payroll system generally needs to be 100% correct. Other data, while we would love to have 100% correctness, may not be as critical. An example of this could be data from surveys — we know not everyone is going to be completely accurate in their responses but if it is only 80% reliable, it is still better than no data and is probably good enough.

Testing Strategy (derived from QS)

The Test Strategy (TS) for a given workload/pipeline will have a similar content structure to the QS, but with a lot more context and detail. It will have the specific requirements for a given workload and all the details of what tools, what teams and key contacts and generally how testing for that workload will be done in detail. When structuring our tests overall, we can start to borrow a lot of concepts from software engineering testing.

The Test Pyramid

Classical Software Engineering Test Pyramid

The classical test pyramid used in software engineering is meant to depict the concept that we have different types of tests that differ in volume and time taken. Typically unit tests are very small and fast to run and there are a lot of them. These tests tend to get run a lot, and in order to not impact productivity for engineers, they need to complete quickly, in seconds ideally. Other tests like Integration tests (between components for example) might take a bit longer, or have environmental dependencies that are not always quick to achieve. End to End tests, check across the whole system and typically involve simulating full workflows that may be quite slow to run. The relevance here to our test strategies is that this is something that needs to be taken into account when designing our test strategy so that right tests are run at the same time. Generally speaking in software engineering, most of these tests occur before production deploy but some tests may run in production too.

This can be applied to data engineering workflows where some of the tests we run, for example, on the correct implementation of a metric are essentially a unit test and will run very quickly, but other tests that we may need, especially when considering large data volumes, may take a long time to run, and in data unlike software we generally don't have direct control over data validation, so production testing aka observability is a critical necessity. We will get into the details further on about specific test types we commonly need in data engineering later on.

Another benefit of this approach is that you can expose the higher order tests to end consumers without exposing them to all the lower level, intermediate level testing that is useful for data engineers building the pipelines. It goes beyond the scope of this article, but a domain driven design approach which this idea supports, hides complexity and has many benefits one of which is reducing the amount of noise and cognitive overload by only exposing what is needed. This just makes it easier for people focus on what is important to them.

Implement automated tests and monitor

Once we have defined our overarching quality strategy and created a test strategy for a given workload, we get to the fun stuff — actually implementing the tests themselves.

So far, we have talked at a high level about defining quality and testing, but we obviously need to consider the actual implementation approach as well.

From here on in, we are assuming an agile, DevOps style of implementation using CI/CD pipelines that focuses on as much automation as possible with a high degree of actively managed test coverage.

Similarities between software and data engineering

There are a lot of similarities between testing software and testing in data engineering pipelines.

If you can think of a test you need, it can almost always be automated.
The same techniques and a lot of the common testing tools apply. QS and TS concepts are the same and a lot of open source tooling for software can be applied easily to data problems.
The same kind of benefits in creating fast feedback loops. Increasing the new work to fixing bugs ratio can be drastically improved. Just these two benefits (and there are more) create happier end users and happier engineers, improve quality drastically, reduce risk and create a significant net increase in productivity and efficiency for engineering teams. In a lot of cases I have seen teams go from spending 80% of their capacity fixing bugs and operational issues to less than 5%. It takes discipline and effort to get there, but it makes a huge difference.

Differences between software and data engineering relevant for testing

Direct input validation of data does not exist like in a typical software application. Surprises happen at mostly at run time (in production) not at build/design time
Usage context of data is NOT implied unlike in an application that has its own controls on who can see and do what with its data. This leads to complexities in the willingness of people to share data when they don't know or trust what will be done with it.
Comprehensive metadata is critically important in order to create controls that an application may enforce. Most software engineering neglects this as it is implied by the application itself.
Data can be big — really big — testing can be slow and expensive if not applied mindfully. This actually can be a problem in software engineering in some cases, but it is far less common than in data engineeering.
Data coming in from the outside world in production will inevitably have issues. In software engineering where we have typically have control of data validation we break the build on a failing test. When a pipeline runs, what to do about data issues is more complex. If correctness of data is paramount, we may indeed abort the pipeline and create an incident but in other cases we may just want to quarantine the questionable data or allow the pipeline to run, but notify someone that it needs remediation or at least investigation and update our catalog or data quality tool accordingly.
The general categories of things we will test for are not identical. They are generally simpler in data engineering but there are some things as well that are fairly specific to data. Metadata validation is one example.

Challenges with data testing and traps to avoid

In 2025, a software engineering style testing of data is relatively rare in most organisations. As a result, it is easy to make mistakes in attempting it for the first time that make it harder than it needs to be, and can even lead to total failure.

Social challenges within the Data team itself

Many enterprise data teams essentially do no testing at all today. The time cost of including it into a teams workflow, plus a learning curve can feel like it is adding a lot of time and cost to engineering work that impacts committed plans and timeframes. While the learning curve may end up adding extra time, it is false economy to not do it.

The cost of repairing and reloading data down the line far outweighs the initial cost by orders of magnitude. This is self evident to any team who has implemented a good and disciplined automated testing approach. Although this is now generally accepted wisdom in software (though still not universally applied). I was around when CI/CD style test automation was new in software development and the same arguments were made back then as now. The big advantage we have this time is that data testing compared to software testing is a relatively simple affair and the tooling and test frameworks are quite good compared to the mid 2000s when CI tooling was brand new and very basic compared to what is available now as well and automated testing was a hotly debated topic.

Organisational and Mindset Inertia

Beyond the data engineering team, most enterprises have a lot of policies and procedures for how data should be managed and controlled. Data governance, security, development and release processes have evolved over time to handle the existing data landscape. Despite the fact that in practice, such policies tend to be at best partially effective, getting these changed to enable a true DevOps model can be extremely difficult in practice.

In an application world, it can be hard enough where there is tight coupling to legacy applications that make automated integration and acceptance testing difficult. In the world of data, security is a constant concern and we are increasingly all subject to some form of regulation in how we handle data like PII. This is an order of magnitude more scrutinised in regulated industries where the list of stakeholders who take an interest in how data is handled extends broadly.

The way most enterprises go about managing their data hasn't evolved much since the 1990s. Some of the tools may be more modern, and may support new ways of working but I see a lot of organisations applying their current way of working and thinking to these new products.

Getting an organisation's mindset and then changing processes and policies across a wide group of departments is objectively hard. It's almost always the single hardest task of moving to an automated approach that is executed as DevOps was intended to be done. It's part of why recognising and including the social aspect of not just the data team but the wider business is a core element in solving the bigger challenge of data quality management. In Part 2 I will talk about some ways I have been able to do this successfully in various companies.

Speed Matters

I have seen teams without a good test strategy lump all testing together and complain that it slows down their development and generates excessive costs due to expensive long running tests being run more than is needed. Certain commercial data quality tools exacerbate this problem by purporting to be the only thing you need (more on this in Part 2). Test the right thing in the right step of a pipeline and optimise for engineer workflow efficiency and it's a lot easier to manage for ROI on test runs. There is also a high degree of correlation between speed and cost, assuming your code is at least reasonably efficient, faster is cheaper.

Language Limitations for Testing

SQL and Python are the main languages used by data and analytics engineers, ML engineers and Data Scientists. SQL has a fairly limited set open source tooling for testing (though some exist). Python on the other hand has great support for testing and has good integration with many data environments. It's important to note that any test framework in any language can be used provided that it has the ability to access data in your platform.

"The Business" is involved

Non-technical business users and analysts are drawn deeper into the solution than with typical software engineering testing. Semantic understanding of the data given to them is required by people interacting with data directly. A good catalog with accurate and up to date quality metrics are mandatory for usability and understanding.

Silver bullet thinking

There isn't really a way to put this delicately. The reality of data engineering over the years is that most engineers aren't really engineers in the sense that they are creating modular pipelines, writing code and tests, applying proper version control to both data and code etc. Generally just following a workflow that any competent software engineering team would follow. Before I start getting hate mail for saying this, there is a long running argument that a lot of software engineers aren't "real"" engineers either. From the perspective of other engineering disciplines (mechanical, electrical etc) there is often the perception that software folks appropriating the "engineer" title is akin to homeopathy practitioners calling themselves "scientists". So take this with a grain of salt.

Here goes. Before "Data Engineer" was a title, most people who did this line of work exclusively were generally known as "ETL developers" and almost exclusively used no/low code ETL tools such as Microsoft's SSIS (basically the same thing as Azure Data Factory without a subscription model) or Informatica's ETL offerings. Until the "Big Data" revolution happened that forced a few software engineers into data, almost everything was "click ops" and traditional vendor offerings in the data space center around tools that "do everything". Successful solutions don't work like this in practice and no one vendor, no matter how good their product is, does everything you need. In the Data Quality domain, we need a more nuanced approach and though there are commercial tools that are excellent, without a scalable and affordable licensing or cost model, the ability to make effective use of them broadly becomes an issue. A lot of common commercial tools available and in widespread use are ineffective and promote bad practice — as well as being very expensive. The good ones generally focus on solving one type of problem well.

Capability and Capacity

A software engineering centric approach to testing necessarily involves a new skillset that most enterprises are very short of in their data teams. Before embarking on a wholesale shift to a new way of working, you need to have people in place who know what good looks like. Fortunately people do exist who know what a proper devops implementation looks like and have experience implementing them. If no one knows what good looks like before you begin, expect the learning curve to be longer and more mistakes to be made.

I'm going to stop here as we have covered most of the theory parts of a better way to approach data testing, Part 2 will dive into practically how you can go about doing this with a few worked scenario examples.

Wrapping Up

As a quick recap:

We need to start by thinking about quality and defining it so we can design the right approach to testing
Testing data in data engineering pipelines has many similarities to software engineering and some important differences
If you're looking to change up the way of testing data it's a bigger problem than just a technical one. Arguably out of the usual people, process and technology buckets we can usually group things into, the people one is the hardest to solve and the key to success.
As an industry we have proven approaches to leverage and the knowledge exists as to what good looks like. It's not super common yet, but it's there.
In Part 2 I will talk more about how to apply this sort of approach in a typical larger environment where most of this is not in place and what has worked for me and others to move in a better direction.

#data-engineering #data-testing #data-governance