A visit to the nonprofit that powers most of today’s AI training: Common Crawl ingests the web, journalism and all, unapologetically. The article falls a bit short on fair use and other archives, but it’s a good read. (Alex Reisner, The Atlantic)
Summary
- Common Crawl, a nonprofit, has been quietly scraping paywalled articles to train AI models, despite publishers' requests to remove content.
- The organization's executive director defends this practice, arguing that "you shouldn't have put your content on the internet if you didn't want it to be on the internet."
- Common Crawl's archives appear to still contain millions of articles from major news outlets, despite claims of compliance with removal requests.