#scraping

Public notes from activescott tagged with #scraping

Sunday, January 4, 2026

Tuesday, November 4, 2025

Sounds like news websites need to hire a proper engineer. This isn’t common crawls problem to solve:

Common crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.