Ole Reissmann

About · Newsletter

AI & Journalism Links

A visit to the nonprofit that powers most of today’s AI training: Common Crawl ingests the web, journalism and all, unapologetically. The article falls a bit short on fair use and other archives, but it’s a good read. (Alex Reisner, The Atlantic)

Summary

  • Common Crawl, a nonprofit, has been quietly scraping paywalled articles to train AI models, despite publishers' requests to remove content.
  • The organization's executive director defends this practice, arguing that "you shouldn't have put your content on the internet if you didn't want it to be on the internet."
  • Common Crawl's archives appear to still contain millions of articles from major news outlets, despite claims of compliance with removal requests.

posted 6.11.2025 by oler · AI & Journalism

You are seeing a single entry in AI & Journalism Links. The previous entry is What news sources does ChatGPT recommend in Germany?, the next entry is Perplexity vs. Amazon: Battling over Comet’s agentic digital shopping sprees?.

Subscribe to THEFUTURE

File this under 'newsletters that won't make you want to hurl your device into the sea' — a weekly exploration of media transformation that's actually engaging. Subscribe to THEFUTURE because, well, why the hell wouldn't you?

What you need to know right now.