#scraping

Public notes from activescott tagged with #scraping

Wednesday, April 8, 2026

Playwright Stealth Mode in 2026: The 7 Patches That Actually Matter - DEV Community

dev.to/vhub_systems_ed5641f65d59/playwright-stealth-mode-in-2026-the-7-patches-that-actually-matter-46bp

#8:27 PM

scraping code playwright

Sunday, April 5, 2026

The Attack: How it works | BrowserGate

browsergate.eu/

Microsoft is running one of the largest corporate espionage operations in modern history.

Every time any of LinkedIn’s one billion users visits linkedin.com, hidden code searches their computer for installed software, collects the results, and transmits them to LinkedIn’s servers and to third-party companies including an American-Israeli cybersecurity firm.

The user is never asked. Never told. LinkedIn’s privacy policy does not mention it.

Because LinkedIn knows each user’s real name, employer, and job title, it is not searching anonymous visitors. It is searching identified people at identified companies. Millions of companies. Every day. All over the world. This is illegal and potentially a criminal offense in every jurisdiction we have examined.

LinkedIn loads an invisible tracking element from HUMAN Security (formerly PerimeterX), an American-Israeli cybersecurity firm, zero pixels wide, hidden off-screen, that sets cookies on your browser without your knowledge. A separate fingerprinting script runs from LinkedIn’s own servers. A third script from Google executes silently on every page load. All of it encrypted. None of it disclosed.

Every time you open LinkedIn in a Chrome-based browser, LinkedIn’s JavaScript executes a silent scan of your installed browser extensions. The scan probes for thousands of specific extensions by ID, collects the results, encrypts them, and transmits them to LinkedIn’s servers. The entire process happens in the background. There is no consent dialog, no notification, no mention of it in LinkedIn’s privacy policy.

This page documents exactly how the system works, with line references and code excerpts from LinkedIn’s production JavaScript bundle.

See https://browsergate.eu/how-it-works/

#3:17 PM

scraping microsoft privacy linkedin security

Friday, January 30, 2026

microlinkhq/cloudflare-bot-directory: CloudFlare Radar verified bots directory – 500+ web crawlers and user agents as JSON.

github.com/microlinkhq/cloudflare-bot-directory

A comprehensive list of 500+ verified bots and web crawlers from CloudFlare Radar, available as a JSON dataset for bot detection, user agent analysis, and web scraping identification.

Why

Identifying legitimate bots from malicious scrapers is essential for web security and analytics. This package provides the official CloudFlare Radar verified bots directory, helping you:
Detect verified bots – Identify legitimate crawlers like Googlebot, Bingbot, and more
Filter analytics – Exclude known bots from your traffic reports
Allow-list crawlers – Permit verified bots while blocking suspicious traffic
User agent lookup – Match user agent strings against known bot patterns

#7:53 AM

cloudflare scraping code

Wednesday, January 28, 2026

Sentience API - Verification & Control Layer for Browser AI Agents | Semantic snapshots, assertions, traces + artifacts. Local-ready, cloud-friendly, vision optional

sentienceapi.com/

An interesting tool that uses playwright to extract structure based on apparently accessibility roles and geometry of “important” elements and use that for an execution agent to process the page results. Important elements are somehow ranked. Then geometry is inferred from those elements.

Also relies on jest-style assertions to explicitly assert whether a step succeeded or failed.

#6:57 PM

scraping agents llm

Sunday, January 4, 2026

shot-scraper

shot-scraper.datasette.io/en/stable/

A command-line utility for taking automated screenshots of websites

#7:30 AM

scraping

Tuesday, November 4, 2025

Common Crawl Is Doing the AI Industry’s Dirty Work - The Atlantic

www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/?gift=iWa_iB9lkw4UuiWbIbrWGQv84IP0_-K67yuVC013Fx4

Sounds like news websites need to hire a proper engineer. This isn’t common crawls problem to solve:

Common crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles.

#7:57 PM

ai scraping