Where Your AI Prompts Really Go: A Practical Guide to AI Privacy

Most people send sensitive prompts to AI tools without knowing where that data ends up. Here's what really happens, why the rules differ…

Sébastien Dubois

Personal Knowledge Management Journal

· ~12 min read · April 28, 2026 (Updated: April 28, 2026) · Free: No

Most people send sensitive prompts to AI tools without knowing where that data ends up. Here's what really happens, why the rules differ wildly per tier, and what to pay attention to.

You paste a contract into ChatGPT. You ask Claude to review a sensitive email. You hand Gemini your client roster to draft a follow-up. And you (mostly) hope for the best.

In this article, I want to discuss what actually happens to that data once it leaves your machine. Where it lives, how long it stays, who reads it, and which of it ends up training the next model version. The answers are VERY different depending on the provider, the tier, and the configuration.

📬 Subscribe to my newsletter for more tips on AI, Knowledge Management, and (Zen) Productivity.

Introduction

The "AI privacy" conversation is a mess. Marketing copy says one thing. Terms of service say another. Default settings say a third. And every six months, a policy update quietly flips defaults on you while nobody is watching.

I've watched smart people (myself included) paste things into AI tools they would NEVER paste into a public Slack. Not because they don't care about privacy. But because they don't know what happens next. And the providers don't make it easy.

Let's clear the fog.

TL;DR

Most "we don't train on your data" claims are technically true and practically misleading. Here's the short version:

Every prompt can take FOUR distinct paths: inference, logging, training, and human review. "No training" only addresses ONE of them.
The same brand has OPPOSITE defaults across tiers. ChatGPT Plus trains by default; the OpenAI API does not. Same company, different worlds.
"Free" almost always means "trained on". If you're not paying with money, you're paying with data.
Opt-out toggles reduce the probability of training; they don't guarantee zero training. Only contractual agreements do that.
The real privacy lever is your tier (anonymous, free, paid consumer, API, enterprise), NOT the brand on the box.
Local inference (Ollama, vLLM, llama.cpp) is the only zero-trust option. Nothing leaves your device.

Now let me explain why this matters and what to actually pay attention to.

The Four Data Paths

When you send a prompt to a hosted AI provider, that prompt can take up to four paths simultaneously. This is the foundation. Get this wrong and the rest doesn't make sense.

Inference (mandatory). The prompt has to reach the model and produce a response. Without this, there's no service. No way around it.
Logging (almost always present). Providers store prompts and responses (at least transiently) for debugging, billing, abuse monitoring, and incident response. Retention windows go from "no logs at all" (rare) to "30 days" (common) to "indefinite" (more common than you'd hope).
Training and model improvement. Your prompt and the model's response get added to a dataset that future models train on. Once it's baked into a checkpoint, it's effectively unrecoverable; the model has been mathematically influenced by your data.
Human review. Trust and safety teams (or contractors) read prompts and responses for quality evaluation, labeling, abuse triage, and moderation. Sometimes this overlaps with training; sometimes it's separate.

Here's the crucial point: a "we don't train on your data" claim usually means path #3 is off. It does NOT mean paths #2 and #4 are off. Reading carefully matters.

You might wonder: how big a deal is this? Big. A provider that doesn't train but retains your prompts indefinitely, with possible human review, is not as private as one that retains for 30 days then deletes everything. Both can claim "we don't train". Only one is actually private.

The Five-Tier Ladder

The same model from the same provider has radically different defaults depending on which product surface and which subscription tier you're on. From most permissive to most restricted:

Anonymous / unauthenticated demo. Public chat, no login. The provider trains on you freely. The legal basis is weak; the data is high quality. Examples: Hugging Face Spaces demos, public Le Chat sessions before login.
Authenticated free tier. Logged-in, free plan. Provider almost always trains on prompts by default. Opt-out toggle exists, somewhere. Examples: ChatGPT Free, Claude.ai Free, Gemini Free, Le Chat Free.
Consumer paid tier. ChatGPT Plus, Claude Pro, Gemini AI Pro, Grok Premium. The default varies by provider; this is where it gets confusing. Some still train by default with an opt-out; others don't. Pre-2025 advice is often wrong here because policies keep flipping.
Developer API. api.openai.com, api.anthropic.com, etc. The default since OpenAI's March 2023 policy change is no training on inputs/outputs. Logs are still kept for abuse monitoring (typically 30 days). Zero-retention is sometimes available on request.
Enterprise / Business plan with DPA. ChatGPT Enterprise, Claude Enterprise, Gemini for Workspace. Adds: signed Data Processing Agreement, contractual no-training, customer-managed encryption keys, SOC 2, data residency, sometimes BAA for HIPAA. Free trials of these tiers do NOT include the DPA.

The most important thing to understand: the brand is mostly noise; the tier is signal. "OpenAI's privacy" doesn't mean anything. "ChatGPT Plus's privacy" and "OpenAI API's privacy" are two completely different conversations.

Why Defaults Flip Across Tiers

The economics flip at each level, and the defaults follow the economics.

Anonymous and free users are paid in data. Their prompts are valuable training material; their privacy is the price.
Paid consumer users have begun paying in money. Providers can afford some friction (the opt-out toggle).
API customers usually build products for OTHER users. If the API trained on customer data, those products' end users would be exposed without consent. The legal liability would be enormous; the default flipped.
Enterprise customers are often regulated (healthcare, finance, government). They won't deploy without DPA, encryption, audit. Providers building this tier are selling compliance, not just a model.

Practical implication: moving up the ladder is NOT free. Each tier above #2 typically multiplies your cost per token by 2–10x. The privacy guarantees are real, but they are sold.

The Patterns You Need to Know

Across most providers (OpenAI, Anthropic, Google, xAI, Mistral, Perplexity), three patterns hold remarkably well as of early 2026:

Pattern 1: "Consumer trains, API doesn't"

This is the most consistent rule in the industry. ChatGPT Plus trains your data by default. The OpenAI API does not. Claude.ai (any consumer tier) trains your data by default since Sep 2025. The Anthropic API does not. Same company, opposite defaults.

The exceptions are worth knowing:

Cohere is the western API outlier; production API trains BY DEFAULT. Toggle Data Controls off, sign Enterprise/ZDR, or use Private Deployment.
Gemini API free tier trains by default. Upgrade to paid (Cloud Billing account) for no-train.
xAI API + free credits depends on the specific promo.

Pattern 2: "Free" almost always means "trained on"

For aggregators like OpenRouter and HuggingFace Inference, "free" models are typically free because the upstream provider trains on them. OpenRouter exposes this via a per-model data_policy.training flag, which is the right model to follow.

The economic logic is consistent: someone has to pay for compute. If you're not paying, you're paying with your data. That's just a fact.

Pattern 3: Enterprise tiers add four things, in order

Contractual no-training (most providers)
Data Processing Agreement (DPA, required for GDPR work)
SOC 2 / ISO 27001 / HIPAA BAA (compliance attestations)
Zero Data Retention (the strongest commitment; not always available)

A claim of "enterprise-grade privacy" without a signed DPA, a current SOC 2 Type II report, and a written ZDR option is marketing, not compliance. Don't get fooled by the badge wall.

Provider Quick Reference

You don't need to memorize every provider's policy. But you should know enough to spot the gotchas. Here's the short list:

OpenAI. ChatGPT Plus/Pro train by default (toggle: Settings → Data Controls). API: no training, 30-day abuse logs by default. Caveat: NYT preservation order (May-Sep 2025) held consumer logs longer than policy.
Anthropic. Since Sep 28, 2025, Claude.ai Free/Pro/Max train by default (opt-in retention up to 5 years). API: no training. Trust & Safety classifier flags can retain content up to 2 years even with ZDR.
Google Gemini. Consumer trains by default (toggle at myactivity.google.com). Vertex AI: no training, but Vertex GLOBAL endpoint silently breaks data residency. Use REGIONAL endpoints for compliance.
xAI Grok. Trains on X by default. "Private Chat" mode (ghost icon) never trains, 30-day hard delete. Check your X-side toggle; some unauthenticated visitors outside EU/UK can't opt out at all.
Mistral. Le Chat trains by default; toggle in Admin Console. La Plateforme Scale: no training. Strongest EU sovereignty story among frontier providers.
Cohere. Production API trains by DEFAULT. The outlier. Toggle Data Controls off or sign Enterprise/ZDR.
Perplexity. Consumer Pro trains by default; the API does NOT. OPPOSITE defaults under one brand. Toggle in Settings → AI Data Retention.
Cursor. Privacy Mode forced ON for Business/Enterprise. Sometimes provides STRONGER guarantees than calling the upstream API directly (contractual zero-retention with Anthropic, xAI, OpenAI, Vertex).
Azure OpenAI. Stores prompts for 30 days BY DEFAULT for abuse monitoring. The Modified Abuse Monitoring (MAM) form opts out, but eligibility is restrictive. For regulated data on Azure, MAM is essentially mandatory.
AWS Bedrock. Structurally prevents Anthropic, Cohere, Mistral, etc. from accessing prompts. Often STRONGER separation than calling those providers' direct APIs.
Hyperscalers via direct API. The trade-off: stronger structural separation, but new models often appear weeks earlier on direct APIs.

For the full per-provider matrix (50+ providers), see my dedicated LLM wiki: AI Wiki — AI Providers — Index

The Opt-Out Reality Check

Opt-out toggles reduce the probability of training on your data; they do NOT guarantee zero training. Here's why:

UI relocation. Toggles move between product redesigns. The setting that disabled training in February may be in a different panel by August.
Policy changes. Some providers explicitly state that prior opt-outs may not apply to new product features.
Default reset on plan change. Switching from Plus to Team to Enterprise can reset toggles to tier-default values.
Family / team account inheritance. Seats in a team plan may inherit the team owner's opt-out state, not yours.
Geographic variance. EU and UK users often see different defaults than US users due to GDPR.
Feedback exception. A thumbs-up/down click can authorize retention beyond the toggle (Anthropic explicitly does this for 5 years).

The honest framing: an opt-out toggle is a probability lever, not a guarantee. For a hard guarantee, you need the contractual mechanism (DPA). Period.

What To Actually Do

Step 1: Identify your tier BEFORE you send the prompt

Are you on a free tier? A paid consumer tier? An API? An enterprise plan with a DPA? The answer determines your default; the brand is secondary.

Step 2: Match the tier to the use case

Casual exploration

Consumer paid + opt-out

Building a product

Direct API (paid)

Regulated commercial (HIPAA, finance)

Hyperscaler with DPA + BAA

EU sovereignty required

EU sovereign (OVHcloud, Scaleway, IONOS)

Maximum privacy, sensitive personal

Local (Ollama, vLLM, llama.cpp)

AI coding

Cursor with Privacy Mode

Step 3: For sensitive prompts, run them locally

If you're dealing with health data, client data, contracts, anything you wouldn't paste in a public Slack: run a local model. Ollama, LM Studio, llama.cpp, vLLM. Inputs and outputs never leave your machine. It's the only zero-trust option.

Step 4: For aggregators, audit per model

OpenRouter, Hugging Face, Bedrock, Vertex Model Garden: the operative training/retention policy is PER-MODEL, not per-aggregator. Mixing free and paid models in one app produces inconsistent privacy guarantees per request. That's a compliance trap waiting to happen.

Step 5: For enterprise, demand the DPA, not the Trust Center

A Trust Center page is a marketing surface. A signed DPA is a contract. Don't confuse the two. If you're handling PHI, confirm the BAA is signed BEFORE you send anything.

Step 6: Re-read policies every quarter

Privacy policies change. The dates in this article reflect what was true on April 28, 2026. Anthropic flipped consumer Claude defaults in Sep 2025; Cursor flipped Privacy Mode default in mid-2024. A green-tier evaluation from 18 months ago is not authoritative. Assume drift.

Personal Privacy Hygiene

Tier choice is the structural lever. But your own behavior matters just as much. Pick the right tier and still leak data through bad habits, and you've solved nothing. Here's the personal practice layer.

Calibrate your sensitivity

Different data deserves different treatment. A simple classification I use:

Public: blog drafts, public docs, marketing copy. Any tier works.
Internal: meeting notes, internal docs, project plans. Avoid free tiers; use API or paid + opt-out.
Confidential: contracts, client work, salary data, strategy. Enterprise tier with DPA, OR local.
Regulated: PHI, large-scale PII, financial records, kids' data. BAA-signed enterprise OR local. Nothing else.

The simple test: would I paste this in a public Slack? If no, don't paste it in a free-tier chatbot either. Same risk surface, different shape.

Default behaviors I recommend

These are non-negotiable in my own practice. Treat them as defaults; deviate consciously, never by accident.

NEVER paste secrets. Passwords, API keys, tokens, .env contents, private SSH keys. Treat the chat box like the public internet.
Anonymize client data BEFORE sending. Replace names, emails, IDs with placeholders. Reverse the substitution after you get the response. Yes, it's annoying. Do it anyway.
Don't share data that isn't yours to share. Client contracts often include confidentiality clauses that AI tools violate by default. You owe your clients this discipline.
Strip metadata before uploading files. PDFs, Word docs, images carry author names, GPS coordinates, edit history. The visible content is rarely the whole story.
For health, finance, legal: assume the chat is being read. Use local models or BAA-signed enterprise. Nothing in between.
Audit your chat history monthly. Delete what you don't need. The shorter your history, the smaller your breach surface.
Use Private modes when available. Grok Ghost, ChatGPT Temporary Chat, Claude.ai's "don't save". They're not perfect, but they're better than nothing.
Separate work and personal accounts. Don't let an employer's audit of your work account expose your personal life (and vice versa).
Disable AI features in productivity tools (Google Docs Gemini, Slack AI, Notion AI) until you've checked the surface's policy. Bolted-on AI usually inherits the host product's defaults; rarely what you want.
Treat shared chat links as PUBLIC. Once a shareable link exists, search engines can find it. There's a long history of leaked ChatGPT shared links to prove it.
For sensitive code: local LLM or self-hosted only. A leaked API key in a prompt is in the breach surface forever.

Build your personal threat model

Before pasting anything sensitive, three questions:

What's the worst case if this leaks? A draft tweet leaking is a shrug. A client list leaking is a lawsuit. Match the precaution to the consequence.
Whose data is this? Yours alone, or someone else's? Other people's data raises the bar; consent matters and you usually don't have it for a chatbot.
Is this regulated? HIPAA, GDPR, SOX, FERPA, COPPA. If yes, the playbook is non-negotiable: BAA-signed enterprise or local. No exceptions.

If you can't answer all three before pasting, you're not ready to paste yet. Slow down. Reclassify. Choose the right surface.

A note on team policies

If you manage a team, the personal hygiene layer becomes a policy layer. Three rules I'd recommend:

Tier requirements per data class. No Plus accounts for client work; no free tiers for anything internal. Make it explicit.
Account separation enforced by SSO. Personal accounts shouldn't be able to handle company data; corporate accounts shouldn't be used for personal exploration.
Centralized DPA inventory. Whoever signs DPAs in your company keeps a list of which providers have one and what tier they cover. When in doubt, default to "no, not yet" until the contract exists.

This isn't paranoia; it's just professional practice. The people who get burned are the ones who skip these.

Conclusion

AI privacy is not a binary. It's a four-path, five-tier matrix that varies by brand, by promo, by quarter. The marketing simplifies it; the reality doesn't.

The single most leveraged thing you can do is shift your mental model from "is this provider safe?" to "which tier am I on, and what's that tier's default?" The brand is mostly noise. The tier is signal.

And for the data you really care about: run it locally. Or sign a DPA. There's no third option that holds up under scrutiny.

📬 Subscribe to my newsletter for weekly tips on AI, Knowledge Management, and (Zen) Productivity.

That's it for today! ✨

Enjoying this? Subscribe for free to get articles like this every week.

AI Agents
AI Wiki — AI Providers — Index (the full LLM wiki with 50+ provider deep-dives)
AI Wiki — AI Providers — Data Training Policies Summary (per-provider tier matrix)
AI Wiki — AI Providers — Data Training Practices Overview (the four data paths in depth)
AI Wiki — AI Providers — Privacy Tiers Explained (the five-tier ladder in depth)

About Sébastien

I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.

I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.

If you want to follow my work, then become a member and join our community.

Ready to get to the next level?

Want to use AI as a real thinking partner?

🤖 AI Ghostwriter Guide — Make AI write like you
🧠 AI Master Prompt Workshop — Craft prompts that actually work
🎙️ Knowii Voice AI — Privacy-first voice-to-text powered by AI
🎯 Join Knowii — AI workshops + community + all courses

Found this valuable? Share it with someone who needs it.

Originally published at https://www.dsebastien.net on April 28, 2026.

#ai #artificial-intelligence #privacy #data-privacy #security

< Go to the original