#scraping
Public notes from activescott tagged with #scraping
Wednesday, April 8, 2026
Sunday, April 5, 2026
The Attack: How it works | BrowserGate
Microsoft is running one of the largest corporate espionage operations in modern history.
Every time any of LinkedIn’s one billion users visits linkedin.com, hidden code searches their computer for installed software, collects the results, and transmits them to LinkedIn’s servers and to third-party companies including an American-Israeli cybersecurity firm.
The user is never asked. Never told. LinkedIn’s privacy policy does not mention it.
Because LinkedIn knows each user’s real name, employer, and job title, it is not searching anonymous visitors. It is searching identified people at identified companies. Millions of companies. Every day. All over the world. This is illegal and potentially a criminal offense in every jurisdiction we have examined.
LinkedIn loads an invisible tracking element from HUMAN Security (formerly PerimeterX), an American-Israeli cybersecurity firm, zero pixels wide, hidden off-screen, that sets cookies on your browser without your knowledge. A separate fingerprinting script runs from LinkedIn’s own servers. A third script from Google executes silently on every page load. All of it encrypted. None of it disclosed.
Every time you open LinkedIn in a Chrome-based browser, LinkedIn’s JavaScript executes a silent scan of your installed browser extensions. The scan probes for thousands of specific extensions by ID, collects the results, encrypts them, and transmits them to LinkedIn’s servers. The entire process happens in the background. There is no consent dialog, no notification, no mention of it in LinkedIn’s privacy policy.
This page documents exactly how the system works, with line references and code excerpts from LinkedIn’s production JavaScript bundle.
See https://browsergate.eu/how-it-works/
Friday, January 30, 2026
microlinkhq/cloudflare-bot-directory: CloudFlare Radar verified bots directory – 500+ web crawlers and user agents as JSON.
A comprehensive list of 500+ verified bots and web crawlers from CloudFlare Radar, available as a JSON dataset for bot detection, user agent analysis, and web scraping identification.
Why
Identifying legitimate bots from malicious scrapers is essential for web security and analytics. This package provides the official CloudFlare Radar verified bots directory, helping you:
Detect verified bots – Identify legitimate crawlers like Googlebot, Bingbot, and more Filter analytics – Exclude known bots from your traffic reports Allow-list crawlers – Permit verified bots while blocking suspicious traffic User agent lookup – Match user agent strings against known bot patterns
Wednesday, January 28, 2026
Sentience API - Verification & Control Layer for Browser AI Agents | Semantic snapshots, assertions, traces + artifacts. Local-ready, cloud-friendly, vision optional
An interesting tool that uses playwright to extract structure based on apparently accessibility roles and geometry of “important” elements and use that for an execution agent to process the page results. Important elements are somehow ranked. Then geometry is inferred from those elements.
Also relies on jest-style assertions to explicitly assert whether a step succeeded or failed.
Sunday, January 4, 2026
shot-scraper
A command-line utility for taking automated screenshots of websites
Tuesday, November 4, 2025
Common Crawl Is Doing the AI Industry’s Dirty Work - The Atlantic
Sounds like news websites need to hire a proper engineer. This isn’t common crawls problem to solve:
Common crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.