Digital Ecosystem

The Dark Matter of the Web

April 2026 12 min read Charlotte Camilleri

Ask most people what AI knows, and they will tell you: everything. It has read the whole internet. It knows more than any human could.

This is not quite right. AI has read the publicly indexable web. That is a different thing. And the gap between those two statements is where a significant amount of this conversation should be happening, but mostly is not.

The public web is not the internet. It is the part of the internet that search engines can crawl, that requires no login, that nobody thought to protect. A large and growing proportion of the most valuable human knowledge does not live there. It lives on Discord servers with access links that expire. In paid Substack newsletters. In private Slack workspaces. In WhatsApp groups that practitioners have been using for years. In paywalled academic journals. In internal wikis that will never see a spider.

AI systems trained on the public web are trained on what the public web contains. Which is increasingly: SEO content, marketing copy, forum posts, news articles written on deadline, and content generated by earlier AI systems trained on the same thing.

The practitioners with real domain knowledge are somewhere else. They have been migrating there for years. And nobody building AI is particularly keen to acknowledge the implications.

What AI Was Actually Trained On

The dominant source of training data for large language models is not a curated library of expert knowledge. It is a nonprofit organisation called Common Crawl.

Common Crawl is a web crawler that has been scraping publicly accessible pages since 2008. Its archive stood at approximately 9.5 petabytes as of mid-2023, and it is the single most important pre-training data source in the AI industry. A 2024 analysis published at the ACM FAccT conference found that Common Crawl had effectively become "a foundational building block for LLM development," used so frequently and in such large proportions that its characteristics shape the characteristics of the models built from it.

GPT-3's training data broke down as follows: 60% Common Crawl, 22% WebText2 (web pages from outbound Reddit links with three or more upvotes), 16% two book corpora, 3% English-language Wikipedia. The Mozilla Foundation's analysis was direct about it: over 80% of GPT-3's tokens stemmed from Common Crawl.

Common Crawl is a snapshot of what happens to be publicly accessible. It is not quality-filtered for domain expertise. It is not sourced from practitioner communities. DeepSeek removed nearly 90% of repeated content across 91 Common Crawl dumps just to extract something usable from it. The raw data is that compromised.

OpenAI has confirmed explicitly that its models are trained only on information that is "freely and openly accessible on the internet" and that it excludes content behind paywalls. This is presented as a privacy-respecting policy. It is also a description of a structural knowledge gap.

Where the Expertise Actually Went

This matters because a significant knowledge migration has been happening in parallel with the rise of AI, and it is going in the opposite direction.

Discord. The platform had 260 million monthly active users by 2025, across 32.6 million servers. Over 1.1 billion messages are exchanged daily. The average server size, per Discord's own VP of Sales, is five to 20 users. These are not broadcast channels. They are small, high-trust spaces where practitioners actually discuss what they are doing, without performance anxiety and without an audience of strangers.

The Midjourney Discord has over 20 million members discussing AI image generation in real time. There are servers for every specialist SEO discipline, every programming language, every niche industry. The conversations happening in these spaces are more technically current, more honestly opinionated, and more practically useful than almost anything indexed by Common Crawl. None of it is accessible to AI crawlers.

Substack. By March 2025, Substack had over 5 million paid subscriptions, up from 2 million in 2023. More than 50,000 publications are earning money on the platform. The paywalled tier, by design, is excluded from AI training. The writers who have built audiences substantial enough to charge for access are, generally, the ones with something worth paying for. That content is dark.

Private Slack. Enterprise and professional community workspaces contain years of practitioner knowledge: debugging threads, strategic debates, post-mortems, informal expertise that has never been written up anywhere. Not indexed. Not crawlable. Not in any training set.

Academic research. The most rigorous domain knowledge on most subjects lives behind journal paywalls. A 2025 paper examining OpenAI's models found evidence that GPT-4o likely trained on paywalled O'Reilly Media books, suggesting some access violations may have occurred despite stated policy. But the broader picture for peer-reviewed journals is that the deep academic literature is not in the training data. AI knows what is openly available. It does not know what Elsevier charges $35 per article to read.

This migration is not accidental. Substack's head of lifestyle partnerships has noted that creators are increasingly moving to "paywalled personal" spaces precisely because public platforms have been degraded by scale, algorithmic incentives, and now AI scraping. The people who know things have worked out that public web content produces diminishing returns for them.

80%+ of GPT-3's training tokens came from Common Crawl, a general web scrape with light filtering, per Mozilla Foundation analysis

1.1bn messages exchanged daily on Discord across 32.6 million servers, none of it accessible to AI training crawlers

5M+ paid Substack subscriptions as of March 2025, paywalled by design and excluded from AI training data

79% of top news sites now block AI training crawlers via robots.txt, per BuzzStream 2025 study

The Walls Going Up

Publishers have not sat passively while AI systems scraped their content without compensation. The response has been to block.

By August 2024, 35.7% of the world's top 1,000 websites were blocking OpenAI's GPTBot, a seven-fold increase from the 5% blocking rate when the crawler launched in August 2023. GPTBot saw a 305% increase in request volume between May 2024 and May 2025, meaning the scraping intensified at exactly the moment resistance to it intensified. A BuzzStream study found that 79% of top news sites now block AI training bots via robots.txt.

The blocking has costs. Research from Rutgers Business School and Wharton published in December 2025 found that publishers blocking AI crawlers experienced a total traffic decline of 23.1% and a 13.9% decline in human-only browsing. The trade-off is increasingly difficult to justify, because blocking does not reliably prevent AI citation anyway. You lose the traffic without reliably removing yourself from the training pipeline.

What this produces is an arms race between AI companies expanding their crawl and publishers trying to protect content economics they can no longer rely on. The outcome of that race is that the public web is becoming more adversarial and less representative of actual expertise.

Major publishers with the resources to negotiate licensing deals are doing that instead. The New York Times is suing OpenAI. Reddit licensed its data to Google. These arrangements mean AI companies with the money to pay for quality data get it, and the open-access assumption behind Common Crawl becomes less accurate over time. The gap between well-funded and under-resourced AI training pipelines will widen accordingly.

The Feedback Loop Nobody Is Measuring

Here is where it gets structurally worse.

As the quality of the indexable public web declines and experts migrate to private spaces, the content that remains on the public web is increasingly produced to fill the gap. Much of it is now generated by AI systems trained on earlier versions of the public web.

Research published in Nature in 2024 by Shumailov et al. established that "indiscriminately training generative AI on real and generated content, usually done by scraping data from the internet, can lead to a collapse in the ability of the models to generate diverse high-quality output." Model collapse is what happens when training data is progressively contaminated by prior model outputs. Errors compound. The tails of the distribution disappear. Rare but important knowledge patterns vanish first.

The ICML 2024 paper from NYU researchers found that as more synthetic data is incorporated into training, the traditional scaling laws that have driven AI progress begin to break down. More compute produces diminishing returns when the data is contaminated.

The web is now being written partly by systems trained on the web. That content gets indexed. The next generation of systems trains on it. What the model thought it knew gets recirculated as apparent evidence of what the model correctly knows. The cycle is already running, and there is no mechanism in place to filter synthetic content out of Common Crawl at scale.

The compounding problem

The content that will train the next generation of AI systems includes an unknown proportion of output from the current generation. There is no reliable way to filter it. The degree of contamination is increasing. And the practitioners who could correct it are publishing somewhere AI cannot see.

What the Public Web Actually Contains

Step back from what AI companies say they trained on, and ask what the public web actually looks like.

It skews toward: content produced to rank in search engines, content produced by people with time to produce public-facing content (not the same thing as expertise), content from English-language sources in developed countries, content that has not been paywalled because it was not good enough to charge for, and content produced by media organisations that have been systematically defunded over the past decade.

Stanford's CS324 course on large language models noted that internet data overrepresents younger users from developed countries, and that quality filtering creates its own distortions by marginalising minority voices and non-standard language patterns.

WebText2, which formed 22% of GPT-3's training mix, was built from outbound links on Reddit posts with three or more upvotes. Reddit is a useful proxy for engaged public internet discussion. It is not a proxy for practitioner expertise. The upvote mechanism rewards the confidently stated over the carefully qualified. This is in the base model of the most widely deployed AI systems in the world.

The public web that AI was trained on is not a neutral archive of human knowledge. It is a reflection of what humans do publicly, for free, in English, on platforms that prioritise engagement. That is a biased and shrinking sample of what humans actually know.

What This Means for Search and SEO

AI systems are being deployed as information retrieval tools under the assumption that their training data represents the state of human knowledge. For well-documented public topics, this is a reasonable approximation. For specialist domains where the real expertise has migrated to private platforms, the AI answer is constructed from whatever remains on the public web. Which may be thin, outdated, or dominated by content written to rank rather than to inform.

For anyone building content strategy around AI citation, the question is not just "is my content indexed?" It is "what is my content competing with when the model constructs an answer about my domain?" If the most credible practitioners in a niche have moved their serious discussions to private Discord servers, the AI's view of that niche is shaped by whoever was still publishing publicly. That may be affiliate sites and brand blogs rather than working experts.

For iGaming specifically, this is already visible. The credible practitioner discussion about SEO in regulated markets, about affiliate strategy, about compliance nuances, is not predominantly happening on public blogs. It is in private Slack communities, paid newsletters, closed industry groups. The AI's understanding of iGaming SEO is largely built from the public-facing layer: operator landing pages, affiliate review sites, conference writeups. The actual strategic conversations are elsewhere.

The implication for E-E-A-T is uncomfortable. Google's own quality frameworks ask whether the author has genuine first-hand experience. The public web, increasingly, contains content that has been optimised to appear as if it does. The people who genuinely do are publishing somewhere that AI cannot see and Google cannot easily verify.

The Argument That Is Not Being Had

The AI industry's standard response to the training data quality question is: we filter aggressively, we prioritise high-quality sources, we use RLHF to align outputs with human preferences.

These are genuine improvements on raw Common Crawl. They do not address the structural problem.

Filtering for quality within the public web does not recover knowledge that has moved off the public web. RLHF trains on human feedback about outputs, which can improve fluency and helpfulness without improving the accuracy of domain knowledge the model simply does not have. Making a wrong answer sound more confident is not a solution to a training data gap.

The companies that are acknowledging this are doing it through commercial deals: licensing Reddit, negotiating with publishers, acquiring proprietary datasets. The implicit admission in those deals is that the publicly crawlable web is not sufficient. The companies that cannot afford those deals train on whatever is available.

Estimates suggest that high-quality human-written web text may be effectively exhausted as a training resource. The race to secure proprietary data is a race to avoid that ceiling. The winner of that race will not be sharing the methodology.

What Actually Knowing Things Will Be Worth

There is an argument that this plays out fine. AI is good enough for most queries. The gaps are in specialist domains where most users do not need specialist-level accuracy. The system works adequately at scale.

This might be true if AI were being deployed only for general-purpose queries. It is not. It is being deployed as a replacement for expert consultation: as a medical reference, a legal assistant, a financial adviser, a technical documentation system. In each of those contexts, the gap between the public web and practitioner knowledge is the gap between an adequate and a dangerous answer.

The dark matter of the web is not a romantic metaphor about lost wisdom. It is a description of a structural problem in how the most sophisticated information retrieval systems in history were built. They were built on what was available. What was available was the public web. The public web is not where the most valuable knowledge lives.

That knowledge is on a Discord server that requires an invite. It is behind a paywall. It is in a Slack workspace that has never been indexed. It is in the head of someone who stopped publishing publicly because the engagement-optimised public web stopped being worth engaging with.

AI can tell you what the internet says. It cannot tell you what practitioners know. That distinction is not getting smaller. It is getting larger, faster, as the people who know things make increasingly rational decisions about where to put their knowledge.

Key Takeaways

The dominant source of AI training data is Common Crawl, a general web scrape. Over 80% of GPT-3's training tokens came from it. This is not a curated expert knowledge base, and it was never designed to be one.
A sustained migration of expert knowledge off the public web has been running in parallel with AI's rise. Discord, paid Substack newsletters, private Slack workspaces, and paywalled research all contain practitioner expertise that AI training pipelines cannot access.
79% of top news publishers are now blocking AI training crawlers. The knowledge gap is being actively widened by the publishers whose content was once indexed freely.
Model collapse research (Shumailov et al., Nature 2024) confirms that training on AI-generated outputs degrades model quality over successive generations. The public web is increasingly contaminated with synthetic content, and there is no reliable mechanism to filter it at scale.
For search practitioners: AI answers in specialist domains are constructed from whatever remains publicly indexed in those domains, which may not be the credible practitioner layer. Content strategy that ignores this is optimising for a model that does not accurately represent the knowledge landscape.
For iGaming: the serious practitioner conversation is predominantly in private channels. The AI's view of the industry is shaped by the public-facing layer, which is not the same thing.

Sources

Jacobs, N. and Weil, B. (2024). "A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl." ACM FAccT 2024.
Mozilla Foundation (2024). "Training Data for the Price of a Sandwich: Common Crawl's Impact on Generative AI."
Brown, T. et al. (2020). "Language Models are Few-Shot Learners." GPT-3 training data breakdown. Wikipedia summary.
Shumailov, I., Shumaylov, Z., Zhao, Y. et al. (2024). "AI models collapse when trained on recursively generated data." Nature, 631, 755-759.
NYU Center for Data Science (2024). "Overcoming the AI Data Crisis: A New Solution to Model Collapse." ICML 2024.
BuzzStream (2025). "Which News Sites Block AI Crawlers in 2025?"
ppc.land (2025). "Blocking AI crawlers doesn't stop citations." Includes Rutgers/Wharton traffic impact research, December 2025.
Playwire (2025). "How to Block AI Bots with robots.txt." GPTBot request volume data.
Sacra (2025). "Substack revenue, valuation and funding."
TechRT (2025). "Discord Statistics 2025."
Contentgrip (2025). "Discord and Substack on why niche communities are rising."
Rosenblat, Y. and Strauss, I. (2025). "Beyond Public Access in LLM Pre-Training Data." arXiv.
Stanford CS324 (2022). "Data: Large Language Models."
Privacy policy analysis including OpenAI's stated exclusion of paywalled content: "User Privacy and Large Language Models." arXiv, 2025.