#scraping

Public notes from activescott tagged with #scraping

Wednesday, April 8, 2026

Sunday, April 5, 2026

Microsoft is running one of the largest corporate espionage operations in modern history.

Every time any of LinkedIn’s one billion users visits linkedin.com, hidden code searches their computer for installed software, collects the results, and transmits them to LinkedIn’s servers and to third-party companies including an American-Israeli cybersecurity firm.

The user is never asked. Never told. LinkedIn’s privacy policy does not mention it.

Because LinkedIn knows each user’s real name, employer, and job title, it is not searching anonymous visitors. It is searching identified people at identified companies. Millions of companies. Every day. All over the world. This is illegal and potentially a criminal offense in every jurisdiction we have examined.

LinkedIn loads an invisible tracking element from HUMAN Security (formerly PerimeterX), an American-Israeli cybersecurity firm, zero pixels wide, hidden off-screen, that sets cookies on your browser without your knowledge. A separate fingerprinting script runs from LinkedIn’s own servers. A third script from Google executes silently on every page load. All of it encrypted. None of it disclosed.

Every time you open LinkedIn in a Chrome-based browser, LinkedIn’s JavaScript executes a silent scan of your installed browser extensions. The scan probes for thousands of specific extensions by ID, collects the results, encrypts them, and transmits them to LinkedIn’s servers. The entire process happens in the background. There is no consent dialog, no notification, no mention of it in LinkedIn’s privacy policy.

This page documents exactly how the system works, with line references and code excerpts from LinkedIn’s production JavaScript bundle.

See https://browsergate.eu/how-it-works/

Friday, January 30, 2026

A comprehensive list of 500+ verified bots and web crawlers from CloudFlare Radar, available as a JSON dataset for bot detection, user agent analysis, and web scraping identification.

Why

Identifying legitimate bots from malicious scrapers is essential for web security and analytics. This package provides the official CloudFlare Radar verified bots directory, helping you:

Detect verified bots – Identify legitimate crawlers like Googlebot, Bingbot, and more
Filter analytics – Exclude known bots from your traffic reports
Allow-list crawlers – Permit verified bots while blocking suspicious traffic
User agent lookup – Match user agent strings against known bot patterns

Wednesday, January 28, 2026

An interesting tool that uses playwright to extract structure based on apparently accessibility roles and geometry of “important” elements and use that for an execution agent to process the page results. Important elements are somehow ranked. Then geometry is inferred from those elements.

Also relies on jest-style assertions to explicitly assert whether a step succeeded or failed.

Sunday, January 4, 2026

Tuesday, November 4, 2025

Sounds like news websites need to hire a proper engineer. This isn’t common crawls problem to solve:

Common crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.