Web Scraping

Automated web crawling and scraping, including AI-driven data collection bots and their impact on websites and infrastructure.

Reading List

Under the Hood

The AI Blind Spot: Why Your Robots.txt is Stuck in 2023

Jul 7, 2026

Websites are diligently fighting a 2023 war against AI training crawlers while remaining strategically blind to the real-time answer bots that define the current AI landscape.

AI Training Data Web Scraping AI Search AI Business Models Open Web

Products & Announcements

Anna’s Archive: Official Data Access and Donation Guide for LLMs

May 22, 2026878

Anna’s Archive provides official bulk access methods for LLMs and requests donations from AI entities to support the preservation of the knowledge they use for training.

AI Training Data Shadow Libraries Digital Preservation Intellectual Property Web Scraping

Damage Control

The Rise of Data Poisoning: Sabotaging the AI Slop Machines

Apr 20, 2026387

Internet users are increasingly using 'data poisoning' and misinformation to sabotage AI training sets in protest of unethical web scraping.

AI Training Data Web Scraping Tech Worker Activism Data Poisoning AI Ethics

Damage Control

The Bot Crisis: Why Internet Traffic is 70% Automated

Mar 29, 2026236

Bot traffic is likely much higher than reported, but it can be effectively neutralized using JavaScript-based Proof of Work defenses.

Web Scraping AI Training Data Cybersecurity Bot Detection & Mitigation

Agentic Systems

GitHub - adam-s/intercept: Turn any website into a typed JSON API using self improving agents · GitHub

Mar 27, 2026

A framework for Claude Code that uses self-improving AI agents to transform websites into structured APIs and functional web applications.

Self-Modifying AI Web Scraping AI Agents Browser Automation AI Coding Agents

Agentic Systems

Lightfeed Extractor: LLM-Powered Web Scraping Library

Mar 26, 2026

A TypeScript library for robust, LLM-powered web data extraction and browser automation.

Web Scraping TypeScript Browser Automation Structured Output Natural Language Processing

Products & Announcements

Cloudflare Simplifies AI Data Ingestion with New Site-Wide Crawling API

Mar 11, 2026487

Cloudflare's new API endpoint simplifies website-wide data extraction by automating discovery and rendering into AI-friendly formats.

Web Scraping Retrieval-Augmented Generation Browser Automation AI Training Data Cloud Infrastructure

Damage Control

Shutting Down My Self-Hosted Git After AI Scraper Overload

Feb 11, 2026298

AI scrapers killed my self-hosted git, so I’ve moved everything to GitLab/GitHub and hardened my static blog’s logging.

Web Scraping Self-Hosting Open Source AI Training Data

Damage Control

Bots Overwhelmed Bear’s Reverse Proxy—What Broke and How It’s Now Hardened

Oct 29, 2025209

Aggressive scrapers overwhelmed Bear’s reverse proxy, prompting a hardening of monitoring, capacity, and bot controls in an ongoing battle with hostile bot traffic.

Web Scraping Cybersecurity Self-Hosting DevOps

Damage Control

Ravenous AI Crawlers Are Breaking the Web—and Driving It Behind Paywalls

Sep 2, 2025213

AI crawlers’ ravenous, non-reciprocal scraping is breaking websites and pushing the open web toward paywalled fragmentation.

Web Scraping Open Web AI Training Data Corporate Accountability Technology Economics