AI Tarpits vs AI Citations: Defend Without Going Dark

AI Tarpits: Trap the Bots, But Don’t Bury Your Own Content

An AI tarpit is a web server technique that identifies misbehaving or aggressive crawlers and responds by serving them an endless labyrinth of synthetic, low-value pages. The goal is to consume their bandwidth and dilute the quality of any data they collect for training.

This approach exists in a gray area of the modern web. Some site owners use tarpits selectively against clearly abusive bots, while others avoid them entirely. A blanket “block everything with AI in the name” policy can backfire: the same rules that stop training crawlers like GPTBot can also block the retrieval/indexing bots that surface your content in ChatGPT, Perplexity, Google AI Overviews, and similar systems.

In 2026 the practical reality is nuanced. The relevant question isn’t simply “Should I deploy an AI tarpit?” but rather how to distinguish between different kinds of AI-related crawlers and decide which ones (if any) warrant defensive measures—without accidentally harming the visibility of your own site. Awareness of these distinctions helps operators make informed choices rather than reflexive ones.

TL;DR

Tarpits trap unwanted AI crawlers in infinite mazes of synthetic content. Nepenthes, Iocaine, and Quixotic poison the corpus. Anubis blocks with proof-of-work. Cloudflare AI Labyrinth ships a hosted version on every plan.
The famous 94 percent bot-kill stat belongs to Iocaine on Gergely Nagy's personal site, not Nepenthes and not the industry. Useful as a vibe, not a benchmark.
Cloudflare AI Labyrinth does not use Markov gibberish. It uses Workers AI to generate real, scientifically-accurate decoy pages, pre-cached in R2, deliberately not misinformation.
Indiscriminate blocking has a measurable cost. A Rutgers and Wharton study of 30 major newspapers found a 23.1 percent monthly traffic decline after AI blocking, later revised to roughly 7 percent on a weekly window.
The strategic move is per-user-agent policy: tarpit training scrapers (GPTBot, ClaudeBot, Meta-ExternalAgent), allow retrieval bots (OAI-SearchBot, PerplexityBot, Google-Extended).

What an AI tarpit actually does

Mechanically, a tarpit is three steps. First, it spots a suspect visitor, usually by user-agent string, by an unlinked honeypot path that only a non-rendering crawler would hit, or by a hidden link styled invisible to humans. Second, it serves a page of plausible-looking text. Third, every page links to more pages, each just unique enough to look novel, with no exit and no end. The crawler thinks it has struck content gold. What it has actually struck is a renderable hamster wheel engineered to look productive while teaching its model approximately nothing.

Definition: AI tarpit

A server-side defense that intentionally serves automated visitors an infinite chain of synthetic pages, with the dual goals of wasting the bot's compute and (optionally) corrupting any downstream training set.

Why web admins and devs build tarpits and non-sense for AI?

The catalyst was simple exhaustion. The anonymous developer behind Nepenthes, dubbed "Aaron" by Ars Technica, told reporters his site was getting roughly 30 million hits a month from Facebook's crawler alone. Robots.txt was being ignored. Rate limits were being shrugged off. The feeling that someone else was eating the bandwidth bill while building a product on top of your work was getting harder to swallow. Aaron shipped Nepenthes in mid-January 2025, the post bounced around Mastodon (with Cory Doctorow lending an amplifying push), and within days a small wave of similar tools followed.

The naming theme is half the joke. Nepenthes is a carnivorous pitcher plant. Iocaine is the deadliest poison from The Princess Bride. Quixotic tilts at AI windmills. Anubis weighs your soul (or in practice, your CPU) before letting you pass. The energy is somewhere between digital protest and infrastructure self-defense, and you can usually tell which side a given operator is on by which tool they pick.

That energy has a face, or at least a username. Community voices across the web have been echoing the same sentiment in sharper and more personal terms:

Rather than openly blocking AI crawlers with a digital “Stay Off My Lawn” sign, some websites are adopting more sophisticated countermeasures: intentionally serving misleading, low-value, or computationally expensive content designed to exhaust crawler resources and degrade automated retrieval systems.

Examples of these tactics are shown below. If you rely on web fetch or web search with Anthropic Claude, this article on Web Safe Searching explores practical safeguards and validation layers intended to reduce exposure to hostile, poisoned, or resource-draining web environments.

"I set a honey trap for AI agents with a novel they heard is about them. Now they're flooding the site and talking in hidden rooms."

Legitimate_Neat_384, Reddit r/ChatGPT

"Every day I wake up, pour coffee, and open my stats — and I can almost hear the AI scrapers munching on my words. It's not flattering. It's theft."

Babar Saad, Plain English / AI

The four flavors, optimized for real-world performance (minimal headaches, maximum effect)

Tarpit-adjacent tools fall into two practical camps: poisoners (waste scraper compute + pollute data) and blockers (make them leave quickly). Both fight the same problem—unwanted AI crawlers ignoring robots.txt—but smart deployment turns them from resource hogs into low-maintenance defenses. The goal is asymmetric cost: cheap for you, expensive for them. Here's how to make them work without shooting yourself in the foot.

The poisoners: Nepenthes, Iocaine, Quixotic

These feed garbage or infinite loops to burn crawler bandwidth and poison datasets. Deploy them selectively behind robots.txt disallow rules or on dedicated trap paths to avoid hurting real users or search engines.

Nepenthes (by Aaron B.): The aggressive headline act. It serves an infinite static maze of Markov-chain nonsense. Crawlers wander until their own logic (depth limits, timeouts) bails them out—sometimes for months.
Headache-free version: Run it on a low-priority subdomain or /trap/ path. Tune generation to be deliberately slow but CPU-light (original design already does this well). Monitor your own server metrics; don't let it become your DoS. Warning: Aggressive use can get you deindexed by Google. Use as a honeypot, not on main content.
Iocaine (by Gergely Nagy): Reverse-proxy wrapper on the tarpit idea. Nagy saw ~94% bot traffic drop on his personal site immediately (mostly AI crawlers). This is the most-quoted (and often misattributed) stat.
Headache-free version: Layer it behind basic rate limiting or UA filtering. It's great for poisoning but combine with blocking so you're not endlessly generating garbage for sophisticated adapters. Keep it hidden from humans via path rules or headers.
Quixotic (by Marcus Butler): The gentler, smarter poisoner. It obfuscates and poisons content without full infinite mazes—often pre-generates static garbage files. Ideal if you want deterrence over destruction.
Headache-free version: Pre-generate poisoned versions of content once (saves CPU). Serve them only to suspicious traffic. This avoids live generation overhead and reduces collateral damage. Best for static sites.

Pro tip for all poisoners

Don't do it. Find a way to take the higher ground. Teaching a bot a lesson defeats your actual goals in life and to fighting bots emotionally is ironic and wastes resources.As victimized as you feel, there are better methods.

Use nginx/Apache/Caddy rules to route only bad actors (high request rate, no JS, known bad UAs). This keeps your real traffic fast and clean or use a WAF/CDN. Cloudflare, Fastly, Akamai, AWS WAF, or similar. Let infrastructure built for abuse filtering absorb the garbage before it reaches your app..

The real win is where these items no longer reach your server, use rate limits, WAF rules, caching, challenges, and endpoint hardening.

The blocker: Anubis

Anubis, by Xe Iaso (January 2025), is the standout for most operators. It's a lightweight MITM reverse proxy that issues a Hashcash-style proof-of-work challenge for any visitor with "Mozilla" in the User-Agent—basically everything modern. Real browsers solve it invisibly in JavaScript. Headless and low-effort scrapers usually fail or give up.

By mid-2026, Anubis had 200,000+ downloads and adoption by GNOME, FFmpeg, UNESCO, SourceHut, and many Git forges and wikis under bot-induced denial of service pressure. Drew DeVault (SourceHut) put it best:

"Nepenthes has a satisfying sense of justice to it, since it feeds nonsense to the crawlers and poisons their wells, but ultimately Anubis is the solution that worked."

Headache-free deployment:

Deploy as a fronting proxy (not inline on every request if possible).
Tune difficulty low for humans—most modern browsers handle it in under 100ms.
Add exceptions for known good bots (Googlebot, etc.) via allowlists.
Combine with TLS fingerprinting and basic rate limiting for better accuracy.
Monitor false positives—privacy browsers or JS-disabled users can get hit. Offer a bypass link or admin override.
It's lightweight in Go and scales well, but test under load.

This approach shifts cost to the scraper without heavy poisoning overhead on your side.

Making the whole stack perform cleanly

Layer them: robots.txt (honored by good actors) → rate limiting/fingerprinting → Anubis for blocking → selective tarpit/poison for persistent offenders.
Measure everything: Track CPU, bandwidth, false positives. These tools shine when targeted, not blanket.
Accept limits: Sophisticated crawlers adapt (stealth JS, proxies, paid human farms). These buy time and raise costs, but aren't forever walls. For critical content, consider authenticated access, watermarks, or closed APIs.
Ethics/practicality: Poisoning is satisfying but risks polluting caches or running into legal gray areas. Blocking is cleaner for most sites.

Cloudflare changed the math (and corrected the technique)

Cloudflare AI Labyrinth represented as a maze of decoy pages that traps misbehaving AI crawlers away from the real website — Cloudflare AI Labyrinth turns crawler abuse into a controlled maze of off-topic decoy pages instead of letting the bot keep scraping the real site.

On March 19, 2025, Cloudflare shipped AI Labyrinth, available on every plan including Free, as a single toggle in the bot-management dashboard. That moved the tactic from protest tech you self-hosted to a one-click feature on the world's largest CDN. The interesting part is the technical departure: Labyrinth does not use Markov chains. Cloudflare uses Workers AI to pre-generate real LLM-written pages on diverse, scientifically-accurate-but-irrelevant topics, stores them in R2, and serves them when bot activity is detected. Per Cloudflare's announcement, the pages are deliberately not misinformation. They are real factual content unrelated to the crawled site, served as expensive, off-topic noise.

The decoy links are injected into existing pages as hidden HTML invisible to humans. Any visitor who follows them is, with high confidence, a bot, and Cloudflare feeds that signal into its user and entity behavior analytics pipeline to fingerprint the offender across its network.

Pro Tip: Don't mix up Cloudflare's approach with Nepenthes

Plenty of secondary coverage describes AI Labyrinth as "Markov gibberish at the edge." It isn't. It is real generated prose about real topics, just not about your site. The distinction matters if you care about not poisoning the public web on your way to defending one site.

The five tools at a glance

Tool	Approach	Deployment	Philosophy
Nepenthes	Static infinite Markov maze	Self-host, web-server addon	Offensive, poison the corpus
Iocaine	Reverse-proxy infinite garbage maze	Self-host, proxy	Offensive, same family
Quixotic	Lightweight content poisoning	Self-host, lightweight	Offensive-lite, deter not destroy
Anubis	Proof-of-work challenge for any "Mozilla" UA	Self-host, MITM proxy	Defensive, block don't poison
Cloudflare AI Labyrinth	Real LLM decoy pages, edge-served	Hosted, every plan including Free	Defensive plus fingerprinting

The collateral damage problem

Tarpits and broad blocking carry an invoice nobody likes to read out loud. A Rutgers Business School and Wharton study published December 31, 2025, looked at 30 major newspaper sites including CNN, the New York Times, BBC, the Guardian, Fox News, and Daily Mail. Publishers that blocked AI crawlers via robots.txt saw a 23.1 percent decline in monthly visits and a 13.9 percent drop in human-only browsing. A later revision of the same study, using a weekly window instead of a monthly one, put the headline figure closer to 7 percent. The truth is in the measurement window, not the headline. Either way, the direction is clear: indiscriminate blocking shrinks the audience, including the human one, because retrieval-flavored AI traffic and AI-driven discovery are now part of how people find pages.

Authorized human visitors and trusted crawlers separated from abusive AI scrapers by per-bot access policy — Per-bot policy keeps authorized people, search crawlers, and retrieval agents separate from abusive training scrapers.

Most tarpits also cannot tell a training scraper from Googlebot, Bingbot, a price-monitoring script, or a perfectly legitimate research crawler. They detect by behavior, not by motive — closer to anomaly detection than identity verification. A sloppy deployment risks de-indexing on the search side, spam-filter penalties for thin generated content, and bandwidth bills you wrote yourself.

The binary that actually matters

Here is the part most write-ups miss. "AI bots" is not one category. It is at least two, and they want different things:

Training scrapers read your content to bake it into a future model. Once it is in the model, you have no link, no citation, no traffic. Examples: GPTBot, ClaudeBot, CCBot, Meta-ExternalAgent, Google-Extended (when used for Gemini training).
Retrieval bots index your content at query time to power retrieval-augmented generation pipelines: a real human asks a question, and your page becomes a cited source in the answer. Examples: OAI-SearchBot, ChatGPT-User, PerplexityBot, Google-Extended (when used for AI Overviews retrieval).

Treating both kinds the same is the whole problem. Block everything and you are protected from training and invisible to citation. Allow everything and you fund a training set that competes with you. The point of the per-user-agent toggle is that you no longer have to pick one bad outcome.

Expert Tip: Build the policy bot-by-bot

In your robots.txt, deny the training crawlers explicitly (User-agent: GPTBot, User-agent: ClaudeBot, User-agent: Meta-ExternalAgent, User-agent: CCBot). Allow the retrieval crawlers explicitly (User-agent: OAI-SearchBot, User-agent: PerplexityBot, User-agent: ChatGPT-User). Keep Googlebot and Bingbot with full access. Then run your tarpit, Anubis, or Cloudflare AI Labyrinth only against deny-listed user-agents and unidentified bot fingerprints. The point is granularity, not a sledgehammer.

Why your content needs the citation, not just the click

Multiple Google search windows and browser panels showing AI crawler discovery and citation traffic across the web — AI citation visibility depends on the same discovery paths that broad crawler blocking can accidentally shut off.

The other half of this picture is what the search world calls Generative Engine Optimization (GEO), and what some people still call AEO (answer engine optimization). The idea is to be the source an AI tool quotes when a user asks a question, because in a world where Google AI Overviews already appear in 30 to 40 percent of queries, the click that used to land on your page now lands on Google's summary of your page. If you are not in the summary, you are not in the conversation.

The numbers behind the panic are real. Brandlight's research found that the overlap between top Google links and the sources cited inside AI answers has dropped from about 70 percent to under 20 percent. Conductor's 2026 survey of digital leaders reported that 97 percent see positive impact from GEO efforts, and that GEO now eats roughly 12 percent of the average digital marketing budget, with 32 percent of digital leaders calling it their top 2026 priority. Wikipedia accounts for 47.9 percent of ChatGPT's top citations on factual questions. Reddit accounts for 46.7 percent of Perplexity's top citations. Different model, different favorite watering hole, same underlying reality: source grounding is the new ranking.

If you tarpit the bot that would have cited you, you have just optimized your way out of the answer.

How to set per-user-agent policy without losing your mind

The actual implementation is less exotic than the philosophy. A practical 2026 setup looks like this:

Robots.txt as the polite layer. Explicitly disallow training crawlers. Explicitly allow retrieval crawlers. Use named user-agents, not a single User-agent: * directive that catches Googlebot.
Edge rules as the impolite layer. The polite crawlers will obey robots.txt. The impolite ones will not. That is what Cloudflare AI Labyrinth, Cloudflare's AI bot blocking, or Anubis on your origin are for — your gateway control when politeness goes ignored. Match by user-agent, by ASN where you can, and by behavioral fingerprint where you cannot.
Front-loaded answer-first content. Retrieval bots read the first chunk of a page to decide whether to cite. Put the answer in the first 200 words. Burying the lede now costs traffic from real humans and citations from the AI assistants the humans use.
Monitor what each bot actually does. If GPTBot keeps showing up despite a deny, your log monitoring will catch it — escalate to a hard block at the edge. If OAI-SearchBot shows up after you allow it, your GEO investment is starting to pay off.

Watch out: the OpenAI bot reportedly escapes Nepenthes

Aaron has confirmed that Nepenthes traps "all the major crawlers" with one exception, OpenAI's bot, which has reportedly learned to back out. OpenAI publicly stated, "We're aware of efforts to disrupt AI web crawlers" and that its systems are "designed to be resilient." Treat tarpit effectiveness as a moving target. The arms race is real and it does not pause for your deploy.

Should you deploy a tarpit or not?

The thing nobody should do is fire any of these tools at the whole internet, walk away, and assume the SEO and AI-citation pipelines will sort themselves out. They will not. Per-bot policy is now table stakes — the baseline guardrail between your content and the models that want to train on it without attribution.

Website owner defending a site with layered controls for robots.txt, Cloudflare AI Labyrinth, Anubis, and selective tarpit rules — Defending a website in 2026 means layering polite robots.txt rules with edge controls that only engage when a crawler ignores the policy.

For most operators in 2026, the answer is "Cloudflare AI Labyrinth is a free toggle, turn it on, and pair it with a thoughtful robots.txt." If your hosting is already on Cloudflare, that is the lowest-friction path that gives you defense without inheriting the maintenance bill. If you self-host serious infrastructure that is buckling under crawler load (a Git forge, a wiki, a documentation site), Anubis is what the GNOME and FFmpeg teams reached for, and there is a reason. If your goal is more "send a message to the AI labs" than "keep my server up," Nepenthes and Iocaine are the protest-tech of choice, and they will satisfy that itch.

Key Takeaways

Tarpits are not a single thing. Nepenthes, Iocaine, and Quixotic poison. Anubis blocks. Cloudflare AI Labyrinth fingerprints. Pick the philosophy that matches your actual problem.
The famous 94 percent number is Iocaine, one site, one operator. Cite it carefully, never as an industry benchmark.
Cloudflare AI Labyrinth is real LLM content, not Markov gibberish. Most secondary coverage gets this wrong.
Indiscriminate blocking measurably costs traffic. Rutgers and Wharton put the cost at 23.1 percent monthly (later revised to roughly 7 percent weekly) on major newspapers.
Tarpit the training scrapers, allow the retrieval bots. The whole game in 2026 is per-user-agent policy, not all-or-nothing.
Front-load your answers. Retrieval bots evaluate citation worth from the first 200 words.

FAQ

Will deploying a tarpit hurt my SEO?

It can, in two ways. First, if your detection rules accidentally catch Googlebot or Bingbot, you risk de-indexing or "indexed without content" warnings in Search Console. Second, if search engines crawl the synthetic pages your tarpit generates and classify them as thin or deceptive, the rest of your domain can take a quality-signal hit. The fix is to whitelist legitimate search crawlers explicitly by user-agent and IP range, and to confine the tarpit response to bots you genuinely intend to trap.

Do AI scrapers actually fall for tarpits in 2026?

The smaller and less-tuned ones still do. Major models adapt. OpenAI's crawler has reportedly learned to escape Nepenthes-style mazes, and the cat-and-mouse loop is constant. Tarpits are most effective against bulk, unsophisticated scrapers and bot operators who have not bothered to update detection logic. Against the largest labs, expect diminishing returns over time.

Is poisoning AI training data legal?

In most jurisdictions, on content you control, yes. The ethical question is separate from the legal one. If you treat scraping that ignores robots.txt as unauthorized access, then poisoning the result of that access is consistent with self-defense. If you treat the open web as a commons, the ethics get murkier. The legal exposure for the operator deploying a tarpit on their own site is generally low; the exposure for the scraper ignoring robots.txt is, depending on jurisdiction, sometimes higher than the scraper assumes.

Can I block AI training while still ranking in Google AI Overviews?

Yes, and this is the entire point of per-user-agent policy. Google splits its AI use into Google-Extended (training and AI Overviews retrieval) and Googlebot (classic search indexing). Disallowing Google-Extended in robots.txt prevents your content from being used for training while keeping Googlebot fully welcome. AI Overviews still require Google-Extended access, so the tradeoff is real: opt out of one, opt out of the other. Most operators allow Google-Extended to stay in AI Overviews and disallow the third-party training crawlers (GPTBot, ClaudeBot, Meta-ExternalAgent) that do not feed any retrieval surface they care about.

What's the difference between Anubis and a tarpit?

A tarpit serves the bot junk to waste its time and (sometimes) corrupt its training set. Anubis never serves the bot anything. It interposes a JavaScript proof-of-work challenge that legitimate browsers solve invisibly and headless scrapers fail outright. Tarpits are offensive; Anubis is defensive. Drew DeVault of SourceHut, after running both, summarized the practical verdict: tarpits feel like justice, Anubis is what works at scale.

Is Cloudflare AI Labyrinth available on the Free plan?

Yes. It is opt-in for every Cloudflare customer including the Free tier, enabled with a single toggle in the bot-management section of the dashboard. Cloudflare announced it on March 19, 2025, and it remains free as of mid-2026.

Sources: Safe Web Research Install Guide on PCDrama, Cloudflare Blog: Trapping misbehaving bots in an AI Labyrinth, Cloudflare Docs: AI Labyrinth, Xe Iaso: Block AI scrapers with Anubis, Wikipedia: Anubis (software), TechCrunch: Open-source devs are fighting AI crawlers with cleverness and vengeance, The Register: Anubis fighting off the hordes of LLM bot crawlers, Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes, Hacker News: AI haters build tarpits discussion.