Browser automation

Automating an headless browser is preferred over parsing the downloaded HTML for the browser can execute JavaScript

Web crawler - Wikiwand
Web scraping - Wikiwand

lorien/awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
BruceDone/awesome-crawler: A collection of awesome web crawler,spider in different languages

THE SCRAPINGHUB BLOG
A guide to Web Scraping without getting blocked · daolf
A guide to Web Scraping without getting blocked
Introduction to web scraping
Web scraping for beginners | Apify Documentation

Big Data: What is Web Scraping and how to use it | IT Svit Blog
Big Data Scraping vs Web Data Crawling | IT Svit Blog

Turn Websites into structured data /Dataflow kit
Knowledge Graph, AI Web Data Extraction and Crawling | Diffbot

CSS Selector Capture Pro
SelectorGadget

⚙️ Explain Selenium & Webdrivers automation (Like I'm Five) - DEV Community 👩‍💻👨‍💻
Google Open Source Blog: Introducing WebDriver
WebdriverIO · Next-gen WebDriver test framework for Node.js

Browserbase: Headless browsers for AI agents & applications Browser scraping as a service

task runner for browser tests:
testem/testem: Test'em 'Scripts! A test runner that makes Javascript unit testing fun.
substack/testling: unit tests in all the browsers

Introducing fuite: a tool for finding memory leaks in web apps | Read the Tea Leaves
nolanlawson/fuite: A tool for finding memory leaks in web apps

Bot Detection

window['navigator']['webdriver'] will return true on most headless browsers and it is used to detect bots.
Inject init script to circumvent this: delete Object.getPrototypeOf(navigator).webdriver

Antibot
java - Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection - Stack Overflow

Selenium

Selenium - Web Browser Automation
Selenium, Travis-CI and WebRTC == <&
Using Python with Selenium to Automate Mouse Clicks and Filling Forms
Advanced Automation Tips with Python | Selenium - DEV Community 👩‍💻👨‍💻
How I DIY’d my Budget Using Python for Selenium and Beautiful Soup | by Jennifer Kim | Towards Data Science
Learn How to Automate Browser Testing With Selenium WebDriver — Part 1 - DZone DevOps
Automate 99% of Websites with Selenium 4 and Python | by Frank Andrade | Geek Culture | May, 2022 | Medium

Sahi (software) - Wikiwand
Sahi - Web Automation and Test Tool download | SourceForge.net

5 Best Python Frameworks for WebView Testing | Codementor

Robot Framework
Robot Framework documentation
Robot Framework Introduction
QuickStartGuide/QuickStart.rst at master · robotframework/QuickStartGuide

Crawlee

builds on top of Puppeteer and Playwright

Crawlee · Build reliable crawlers. Fast.
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Headless Chrome

Learn How to Test, Monitor, and Automate with Playwright originally theheadless.dev
checkly/theheadless.dev: 🪖 Learn Puppeteer and Playwright - Tips, tricks and in-depth guides from the trenches. 🗃️archived, old site

Getting Started with Headless Chrome - Chrome Developers
Chrome’s Headless mode gets an upgrade: introducing --headless=new - Chrome Developers ❗!important
Headless Chromium

yujiosaka/headless-chrome-crawler: Distributed crawler powered by Headless Chrome

Puppeteer

v23 added Firefox support

Puppeteer | Puppeteer
API Reference | Puppeteer
Puppeteer Guides | Puppeteer
Puppeteer - Chrome for Developers

Bun have problem running post-install script

bunx @puppeteer/browsers install chrome@stable --path $HOME/.cache/puppeteer

puppeteer/puppeteer: Node.js API for Chrome

Puppeteer Tutorial
Introduction to Puppeteer
Automating Google Chrome with Node.js - Tutorialzine with Puppeteer
The Guide To Ethical Scraping Of Dynamic Websites With Node.js And Puppeteer — Smashing Magazine
Tutorial: User Interface Testing with Jest and Puppeteer
Browser automation revisited - meet Puppeteer | Gergely Nemeth slowMo: 250, //ms, profiling, intercepting requests
Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages

Puppeteer – Headless Chrome in a Container – zwischenzugs
How to set up a Headless Chrome Node.js server in Docker - LogRocket Blog

Koan

Puppeteer quick start - Chrome Developers

checkly/puppeteer-examples: Puppeteer example scripts for running Headless Chrome from Node. 😴inactive

Logging into a website | Apify Documentation

📷 How to take a screenshot of a webpage with JavaScript in Node.js (using puppeteer) - DEV Community 👩‍💻👨‍💻

// `page.evaluate()` will serialize the return
// use this trick to return a non-serialize object
// https://github.com/puppeteer/puppeteer/issues/3986
page.evaluate(await page.evaluate(() => window.toString()));

// get all links
const hrefs = await page.$$eval("a", (as) => as.map((a) => a.href));
console.log(hrefs);

// get property of element
const video = await page.waitForSelector("video.player");
const src = await video.getProperty("src");

// slow down browser operations
const browser = await puppeteer.launch({
  headless: false,
  slowMo: 250, //ms
});

const page = await browser.newPage();
await page.setRequestInterception(true);
// intercept requests
// install this before `page.goto()`
page.on("request", (request) => {
  if (request.url.includes(".png")) {
    request.abort(404);
    return;
  } else {
    request.continue();
  }
});
// log network requests
// install this before `page.goto()`
page.on("response", (response) => {
  // allow XHR only
  // if ("xhr" !== response.request().resourceType()) {
  //   return;
  // }
  console.log(`[${response.request().resourceType()}] ${response.url()}`);
});

Nightwatch.js

Nightwatch.js | Node.js powered End-to-End testing framework
nightwatchjs/nightwatch: End-to-end testing framework written in Node.js and using the W3C Webdriver API

Playwright

targets all the popular rendering engines
binding in Node.js, Python, Java, .NET

Fast and reliable end-to-end testing for modern web apps | Playwright
Fast and reliable end-to-end testing for modern web apps | Playwright Python

microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API. Javascript
microsoft/playwright-python: Python version of the Playwright testing and automation library. Python

Playwright - YouTube

Mastering Web Scraping in Python: Avoid Blocking Like a Ninja - ZenRows

Crawlee · Build reliable crawlers. Fast. | Crawlee
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.

Request interception with Puppeteer and Playwright - DEV Community

Codegen

Test generator | Playwright

Koan

microsoft/playwright-examples

// evaluate all available links
const as = await page.locator("a").all();
// wait for all the getAttribute() calls to resolve
const links = await Promise.all(
  // extract the `href` attribute
  as.map((a) => a.getAttribute("href")),
);