Summary:

AI agents can now build test frameworks 5–10x faster than traditional methods — but this speed amplifies both excellence and incompetence.

Poorly architected automation wastes more time than it saves, and the two main risks are increased by using an agentic coding flow: (1) false positive tests that pass when they shouldn't, creating dangerous false confidence at scale, and (2) accumulated technical debt where agent mistakes compound into unmaintainable frameworks within months.

SDET expertise isn't becoming obsolete — it's the primary defense against both catastrophic automation failures and framework collapse.

This article shows you how to harness agentic test automation while avoiding its pitfalls.

Part I: Context

Recently, I partnered with GitHub Copilot using the Model Context Protocol (MCP) to build a production-grade Playwright BDD framework with C# (any language works, I just love C#) and SpecFlow. The agent wrote 100% of the code. I focused entirely on testing each output and iteratively prompting for the next step.

Result: We built in 16–20 hours what traditionally takes 4–6 weeks. It works and it scales. There is no faster way to add test coverage.

The agentic approach to test automation — where AI agents partner with human engineers in an iterative "test the output, prompt the next step" cycle — represents a fundamental shift. Done right, it's transformative. Done wrong, it's catastrophic.

Two systemic risks emerge:

1. False positive tests: Research shows AI agents systematically exploit test cases to report success incorrectly, making false positives an inherent systemic risk that will never self-correct. [1]

2. Accumulated technical debt: AI's speed enables "just add more code" mentality where small mistakes can be "fixed" by introducing more complicated workarounds and divergent patterns. Unchecked, this tendency compounds into unmaintainable frameworks within months or even weeks.

Bottom line:

Your "green lights" (passing tests) are only as meaningful as the assertions backing them up AND the architectural discipline maintaining them.

The stakes are higher. The risks are real. And SDET skills have never been more valuable.

Part II: Setup Guide

Prerequisites

  • GitHub Copilot subscription — As of November 2025, GitHub offers a free 30-day trial at github.com/github-copilot/pro
  • Visual Studio Code (or Visual Studio)
  • Basic understanding of test automation best practices (assertions, test independence, maintainability)

Step 1: Install MCP Server

This guide focuses on GitHub Copilot, but MCP works with other AI tools too. For setup with Claude Desktop or Cursor IDE, see reference [2].

1. Open VS Code Command Palette: Ctrl + Shift + P (Windows/Linux) or Cmd + Shift + P (Mac)

2. Type: >MCP: Browse Servers and select it

3. This opens the MCP Server Registry (current URL is https://code.visualstudio.com/mcp)

4. Install your preferred server(s):

  • Playwright MCP Server — For Playwright automation projects
  • GitHub MCP Server — For general automation and API testing
  • Both — If you want flexibility (you can switch between them by configuring your agent's tools, but you should not enable both at once due to 128 max tools limitations)

Step 2: Create Your copilot-instructions.md File

This is the most important file in your project. GitHub Copilot reads this file automatically and treats its contents as operating rules for all subsequent prompts in the repository.

First, ask your agent to create it:

Open GitHub Copilot chat, select Agent mode, and prompt:

"Create a .github/copilot-instructions.md file with test automation best practices. Focus on self-validating test steps, proper assertions, smart waits, and maintainable code patterns."

This is where you should add everything that keeps the agent on track. Does your agent keep trying to refactor things with powershell commands instead of making code edits directly? Don't tell the agent to stop doing that, tell it to add a rule against it into copilot instructions so the rule can persist!

Step 3: Build your Framework using this Iterative Development Loop

You: "Create a test automation framework with Playwright and TypeScript using Chrome browser to test gmail.com"

Agent: [Generates project structure, package.json, playwright.config.ts]

You: [Test it — does npm install work? Does npx playwright test run? Commit changes or at least stage them before prompting for the next feature]

You: "Add a Page Object Model for the Gmail login page"

Agent: [Creates LoginPage.ts with locators and methods]

You: [Review the code — are locators using best practices? Test the page object, stage changes if good, ask for fixes otherwise]

You: "Create a test that navigates to gmail.com then allows the user up to 2 minutes to manually login and captures the auth settings in a way other tests can use it"

Agent: [Generates test file with manual auth capture]

You: [Run the test — does it wait correctly? Does it save auth state? Can subsequent tests reuse it? Stage changes again or ask for fixes]

You: "Add screenshot and video capture on test failure, and generate a test report after each run"

Agent: [Updates playwright config with screenshot settings, video settings, and report generation]

You: [Trigger a failure — confirm screenshot and video are captured, viewable from generated test report. Prompt for fixes until it works, then commit changes]

You: "Create a handful of smoke tests. Try to hit the major flows first"

Etc.

That's the loop. You test every output. Agent writes, you validate. Repeat.

Important Note: LLM context windows degrade as conversations grow. Start a new chat every 10–15 interactions to maintain quality. Save progress frequently (commit working code) and work on one single issue at a time. Occasionally the agent will corrupt the entire framework and you'll need to undo all recent changes to get back on track, which is painful if you had lots of uncommitted and unstaged code changes that were actually working.

Part III: Adding Test Coverage with AI Agents

Once your framework is set up, adding test coverage becomes remarkably fast. But how you work with the agent determines whether you build a reliable test suite or a collection of false positives.

Two Valid Approaches to Adding Tests

Approach 1: Dictated Tests ("Build a test that does X then Y then asserts for Z")

What happens when you prompt:

> "Create a test that verifies users cannot submit expenses over $10,000 without VP approval. The form should show validation error 'VP approval required for amounts exceeding $10,000' and disable the submit button."

How the agent responds:

The agent will ask for missing context or make assumptions:

  • "What locator should I use for the amount field?"
  • "Where does the error message appear?"
  • "What's the data-testid for the submit button?"

Or it will generate code using defensive locator strategies from your copilot-instructions.md:

[When(@"I enter an expense amount of ""(.*)""")]

public async Task EnterExpenseAmount(string amount)

{

var amountField = Page.GetByLabel("Amount");

await Expect(amountField).ToBeVisibleAsync();

await amountField.FillAsync(amount);

}

[Then(@"I should see validation error ""(.*)""")]

public async Task ShouldSeeValidationError(string errorText)

{

var errorMsg = Page.Locator(".validation-error");

await Expect(errorMsg).ToBeVisibleAsync();

await Expect(errorMsg).ToContainTextAsync(errorText);

}

[Then(@"the submit button should be disabled")]

public async Task SubmitButtonShouldBeDisabled()

{

var submitBtn = Page.GetByRole(AriaRole.Button, new() { Name = "Submit" });

await Expect(submitBtn).ToBeDisabledAsync();

}

Why SDET expertise matters here:

  • Assertion design: You specified THREE validations (error message appears, submit button disabled, specific error text)
  • Edge case thinking: You thought of the $10,000 threshold test — agent wouldn't suggest this
  • Negative testing: You'll verify this test fails correctly if validation is broken

The specificity spectrum:

*Vague prompt (agent will struggle):*

> "Test the expense form"

*Better prompt:*

> "Test that users can submit expenses between $1-$9,999"

*Ideal prompt:*

> "Test that users can submit expenses between $1-$9,999. Verify: 1) Form accepts amount, 2) Submit button stays enabled, 3) Confirmation shows 'Submitted for manager approval' after submit, 4) New expense appears in 'Pending' list with correct amount."

The more specific your requirements, the better the agent's implementation.

Approach 2: Agent-Generated Tests in Bulk ("Create tests for me")

The shotgun approach can work best when you want to generate multiple test scenarios at once (but each test still requires independent validation). There are two ways to do this:

Option A: Public applications (Gmail, GitHub, Amazon, etc.)

What happens when you prompt:

> "Create smoke tests for Gmail"

The agent leverages training data about common user flows:

  • Login/authentication flows
  • Core features (compose email, search, settings)
  • Common user journeys (create → edit → delete patterns)

Example output you'll receive:

Scenario: User can compose and send email

Given I am logged into Gmail

When I click "Compose"

And I enter "test@example.com" in the To field

And I enter "Test Subject" in subject

And I enter "Test message" in body

And I click "Send"

Then I should see "Message sent" confirmation

The agent will generate Page Objects, step definitions, and assertions based on typical Gmail patterns — all without seeing your specific instance.

Option B: Custom applications (BETTER approach)

Instead of asking the agent to guess, provide business requirements directly:

Your prompt:

> "Build tests to validate the following business requirements for our expense reporting app:

> 1. Users must authenticate before submitting expenses

> 2. Expenses under $10,000 require manager approval only

> 3. Expenses $10,000+ require VP approval

> 4. Receipt upload is mandatory for expenses over $75

> 5. Approved expenses appear in 'Pending Payment' queue

> 6. Rejected expenses return to user with rejection reason

> Submit button locator: id='submit-expense'

> Approval status locator: data-testid='approval-status'"

Why this is better:

The agent CANNOT see your application's DOM or structure. When you provide requirements:

  • Agent builds tests matching YOUR exact specifications (not generic patterns)
  • Tests validate your business logic (not guesses about common flows)
  • Assertions target your specific success criteria
  • Locators use the identifiers you provide

Without requirements, the agent will:

1. Ask clarifying questions: "What are the critical user flows?"

2. Make educated guesses based on common patterns

3. Generate generic templates you need to customize

Why SDET expertise matters for bulk generation:

  • Risk identification: You know which flows matter (login is critical, changing theme color isn't)
  • Requirements translation: You convert business rules into testable acceptance criteria
  • Validation strategy: You verify the agent's assertions actually catch failures
  • Locator guidance: You provide stable selectors or work with devs to add them

The validation loop (mandatory for ALL generated tests):

1. Agent generates smoke test for "Submit Expense"

2. You run it → ✅ Passes

3. You intentionally break it (change expected confirmation text)

4. You run it again → ❌ MUST fail with clear error

5. If step 4 passes incorrectly → Assertion is weak, reject and fix

6. Only after confirming test fails correctly → Accept it

Additional Notes

Should we mix UI tests and API / database tests in the same framework?

Yes! API and database tests run much faster than UI tests and are more stable over time, too. This means the majority of your test cases should be written for the non-UI layers and a small minority of test cases should target the UI directly. The most effective CI/CD test plan is one that covers all endpoints and database tables in every test run plus a small selection of End-to-End UI tests for maximum confidence in minimum possible time. You don't want 20 different UI tests that follow the exact same steps but use slightly different data permutations because running them all takes a lot of time for relatively little value, but if your API tests hit all of those edge and corner cases you get that same value without the long runtime costs.

Is BDD important?

Yes, I recommend implementing BDD (Given/When/Then) from the beginning. Here's why:

Benefits of BDD structure:

  • Organization: Test scenarios are clearly structured with setup (Given), actions (When), and validations (Then)
  • Readability: Non-technical stakeholders can understand what's being tested
  • Maintainability: Step definitions are reusable across multiple scenarios
  • Consistency: Forces a standard pattern across all tests

The refactoring cost:

If you start without BDD and decide to add it later, every single test needs to be reworked:

  • Procedural test code → Gherkin scenarios
  • Inline assertions → Step definitions
  • Page interactions → Reusable step methods
  • Test data → Scenario parameters

This refactoring is painful, time-consuming, and error-prone. Starting with BDD avoids this technical debt entirely.

When to skip BDD:

Only skip BDD if:

  • You're prototyping and plan to throw away the code
  • Your team explicitly doesn't want business-readable tests
  • You're building a small utility with <10 tests and have no need to expand it later

For production frameworks meant to scale, BDD pays dividends from day one.

Conclusion: The Agentic Future

The shift to agentic test automation doesn't reduce the need for SDET expertise — it makes expertise the only thing standing between success and catastrophe.

AI agents can implement patterns and code faster than any human, but they cannot reliably:

  • Choose which tests matter
  • Design robust assertion strategies
  • Interpret complex failures
  • Predict maintenance burden
  • Recognize when to stop adding and start refactoring
  • Maintain architectural consistency across rapid iteration
  • Validate their own correctness
  • Prevent false positives

These require human judgment, experience, and domain knowledge.

The future isn't "AI replacing SDETs" — it's "SDET expertise becoming mandatory gatekeeping for AI-generated automation."

Without SDETs:

  • Risk #1: False confidence (tests pass when app is broken, locators select wrong elements silently)
  • Risk #2: Framework collapse (unmaintainable within 6 months)
  • Result: Automation theater at unprecedented scale

With expert SDETs:

  • Risk #1 mitigated (not eliminated): Rigorous assertion validation, defensive locator strategies, negative testing, constant vigilance
  • Risk #2 mitigated: Architectural discipline, periodic audits, pattern enforcement
  • Result: 5–10x productivity gains — but eternal vigilance required

Organizations that understand this will see extraordinary ROI. Organizations that don't will build disasters faster than ever before.

The choice is yours.

References

[1] Zhong, Z., Raghunathan, A., & Carlini, N. (2025). ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases. arXiv preprint arXiv:2510.20270. https://arxiv.org/abs/2510.20270

[2] Pathak, K. (2025). Modern Test Automation With AI (LLM) and Playwright MCP. DZone. https://dzone.com/articles/modern-test-automation-ai-llm-playwright-mcp

[3] Anthropic. (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol

[4] GitHub. (2024). Using Model Context Protocol (MCP) with GitHub Copilot. https://docs.github.com/en/copilot/using-github-copilot/using-mcp-with-github-copilot

[5] Microsoft. (2024). Playwright for .NET. https://playwright.dev/dotnet/

[6] Fowler, M. (2013). PageObject. Martin Fowler's Bliki. https://martinfowler.com/bliki/PageObject.html

[7] GitHub. (2024). GitHub Copilot Documentation. https://docs.github.com/en/copilot

About the author: Andy Weekes is Senior Consultant II with Neudesic (an IBM Company). Connect on LinkedIn: https://www.linkedin.com/in/andy-weekes-361047a7/

What's your experience with AI-assisted test automation? Share in the comments.