Common Crawl Is Doing the AI Industry’s Dirty Work - The Atlantic

Created 11/4/2025 at 7:57:35 PM

https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/?gift=iWa_iB9lkw4UuiWbIbrWGQv84IP0_-K67yuVC013Fx4

Sounds like news websites need to hire a proper engineer. This isn’t common crawls problem to solve:

Common crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.

ai scraping

Public