Shutting Down My Self-Hosted Git After AI Scraper Overload

Added Feb 11
Article: NegativeCommunity: PositiveMixed

After AI scrapers overwhelmed their cgit frontend, the author shut down their long-running self-hosted git service. They now rely on GitLab and GitHub mirrors as the primary repositories and have updated all links accordingly. The remaining self-hosted static blog had one scraper-induced outage due to log growth, which was fixed by reconfiguring logrotate.

Key Points

  • Self-hosted public git service has been permanently shut down due to relentless AI scraper traffic overwhelming the cgit frontend.
  • All repositories now live primarily on GitLab and GitHub, and links have been updated from the old cgit endpoints.
  • The only remaining self-hosted service is a static Jekyll-based blog, chosen for resilience against scraper load.
  • AI scrapers still triggered an outage by flooding 404 responses for the defunct cgit paths, filling disks via Apache logs.
  • Log rotation was reconfigured to prevent log growth from causing future outages.

Sentiment

The community is broadly sympathetic to the author and frustrated with AI companies' scraping practices. Most commenters agree the problem is real, widespread, and worsening, with many sharing their own war stories from self-hosted instances. However, the discussion is more solution-focused than purely angry, with practical mitigation strategies dominating the conversation. A minority of voices push back, either questioning whether the author tried hard enough or arguing that bot traffic management has always been part of running a public server.

In Agreement

  • AI scrapers are verified to come from major companies like OpenAI using their published IP ranges, and they behave far worse than traditional search crawlers by ignoring robots.txt and repeatedly scraping unchanged content
  • Git forges are uniquely vulnerable because each commit, diff, and blame view creates a distinct URL, creating an enormous attack surface that caching cannot mitigate
  • The scraper ecosystem involves compromised residential devices and proxy networks, making IP-based blocking largely ineffective against the most aggressive actors
  • Suggesting Cloudflare as a solution is ironic because it centralizes the web and undermines the self-hosting independence that was the original goal
  • Legislation is currently too lax, allowing aggressive scraping under the cover of AI training with little accountability

Opposed

  • The author could have used readily available mitigations like a properly configured robots.txt with specific bot User-Agents, which some report effectively eliminated scraper traffic
  • Some commenters see the AI scraper narrative as overblown, suggesting people enjoy claiming AI is oppressing them rather than implementing straightforward technical solutions
  • Scraping and bot traffic are nothing fundamentally new; if you run a public server, you should expect and prepare for malicious traffic
  • A conspiracy theory emerged suggesting cloud companies like AWS or Cloudflare might be driving scraper traffic to force users onto paid infrastructure
  • Requiring JavaScript for bot protection is an acceptable trade-off since fewer than 0.1% of users disable it, and modern accessibility tools work fine with JS
Shutting Down My Self-Hosted Git After AI Scraper Overload | TD Stuff