#scraping

Public notes from activescott tagged with #scraping

Friday, January 30, 2026

A comprehensive list of 500+ verified bots and web crawlers from CloudFlare Radar, available as a JSON dataset for bot detection, user agent analysis, and web scraping identification.

Why

Identifying legitimate bots from malicious scrapers is essential for web security and analytics. This package provides the official CloudFlare Radar verified bots directory, helping you:

Detect verified bots – Identify legitimate crawlers like Googlebot, Bingbot, and more
Filter analytics – Exclude known bots from your traffic reports
Allow-list crawlers – Permit verified bots while blocking suspicious traffic
User agent lookup – Match user agent strings against known bot patterns

Wednesday, January 28, 2026

An interesting tool that uses playwright to extract structure based on apparently accessibility roles and geometry of “important” elements and use that for an execution agent to process the page results. Important elements are somehow ranked. Then geometry is inferred from those elements.

Also relies on jest-style assertions to explicitly assert whether a step succeeded or failed.

Sunday, January 4, 2026

Tuesday, November 4, 2025

Sounds like news websites need to hire a proper engineer. This isn’t common crawls problem to solve:

Common crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.