Skip to content

Browser automation

January 9, 2025
November 20, 2017

Automates browser, for web scraping and testing

Automating an headless browser is preferred over parsing the downloaded HTML for the browser can execute JavaScript

Web crawler - Wikiwand
Web scraping - Wikiwand

lorien/awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
BruceDone/awesome-crawler: A collection of awesome web crawler,spider in different languages

THE SCRAPINGHUB BLOG
A guide to Web Scraping without getting blocked · daolf
A guide to Web Scraping without getting blocked
Introduction to web scraping
Web scraping for beginners | Apify Documentation

Big Data: What is Web Scraping and how to use it | IT Svit Blog
Big Data Scraping vs Web Data Crawling | IT Svit Blog

Turn Websites into structured data /Dataflow kit
Knowledge Graph, AI Web Data Extraction and Crawling | Diffbot

CSS Selector Capture Pro
SelectorGadget

⚙️ Explain Selenium & Webdrivers automation (Like I'm Five) - DEV Community 👩‍💻👨‍💻
Google Open Source Blog: Introducing WebDriver
WebdriverIO · Next-gen WebDriver test framework for Node.js

Browserbase: Headless browsers for AI agents & applications Browser scraping as a service

task runner for browser tests:
testem/testem: Test'em 'Scripts! A test runner that makes Javascript unit testing fun.
substack/testling: unit tests in all the browsers

Introducing fuite: a tool for finding memory leaks in web apps | Read the Tea Leaves
nolanlawson/fuite: A tool for finding memory leaks in web apps

Selenium

Selenium - Web Browser Automation
Selenium, Travis-CI and WebRTC == <&
Using Python with Selenium to Automate Mouse Clicks and Filling Forms
Advanced Automation Tips with Python | Selenium - DEV Community 👩‍💻👨‍💻
How I DIY’d my Budget Using Python for Selenium and Beautiful Soup | by Jennifer Kim | Towards Data Science
Learn How to Automate Browser Testing With Selenium WebDriver — Part 1 - DZone DevOps
Automate 99% of Websites with Selenium 4 and Python | by Frank Andrade | Geek Culture | May, 2022 | Medium

Sahi (software) - Wikiwand
Sahi - Web Automation and Test Tool download | SourceForge.net

5 Best Python Frameworks for WebView Testing | Codementor

Robot Framework
Robot Framework documentation
Robot Framework Introduction
QuickStartGuide/QuickStart.rst at master · robotframework/QuickStartGuide

Crawlee

builds on top of Puppeteer and Playwright

Crawlee · Build reliable crawlers. Fast.
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Headless Chrome

Learn Playwright & Puppeteer | Checkly originally theheadless.dev
checkly/theheadless.dev: 🪖 Learn Puppeteer and Playwright - Tips, tricks and in-depth guides from the trenches. 😴inactive

Getting Started with Headless Chrome - Chrome Developers
Chrome’s Headless mode gets an upgrade: introducing --headless=new - Chrome Developers ❗!important
Headless Chromium

yujiosaka/headless-chrome-crawler: Distributed crawler powered by Headless Chrome

Puppeteer

v23 added Firefox support

Puppeteer | Puppeteer
API Reference | Puppeteer
Puppeteer Guides | Puppeteer
Puppeteer - Chrome for Developers

Bun have problem running post-install script

bunx @puppeteer/browsers install chrome@stable --path $HOME/.cache/puppeteer

puppeteer/puppeteer: Node.js API for Chrome

Puppeteer Tutorial
Introduction to Puppeteer
Automating Google Chrome with Node.js - Tutorialzine with Puppeteer
The Guide To Ethical Scraping Of Dynamic Websites With Node.js And Puppeteer — Smashing Magazine
Tutorial: User Interface Testing with Jest and Puppeteer
Browser automation revisited - meet Puppeteer | Gergely Nemeth slowMo: 250, //ms, profiling, intercepting requests
Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages

Puppeteer – Headless Chrome in a Container – zwischenzugs
How to set up a Headless Chrome Node.js server in Docker - LogRocket Blog

Koan

Puppeteer quick start - Chrome Developers

checkly/puppeteer-examples: Puppeteer example scripts for running Headless Chrome from Node. 😴inactive

Logging into a website | Apify Documentation

📷 How to take a screenshot of a webpage with JavaScript in Node.js (using puppeteer) - DEV Community 👩‍💻👨‍💻

// `page.evaluate()` will serialize the return
// use this trick to return a non-serialize object
// https://github.com/puppeteer/puppeteer/issues/3986
page.evaluate(await page.evaluate(() => window.toString()));

// get all links
const hrefs = await page.$$eval("a", (as) => as.map((a) => a.href));
console.log(hrefs);

// get property of element
const video = await page.waitForSelector("video.player");
const src = await video.getProperty("src");

// slow down browser operations
const browser = await puppeteer.launch({
  headless: false,
  slowMo: 250, //ms
});

const page = await browser.newPage();
await page.setRequestInterception(true);
// intercept requests
// install this before `page.goto()`
page.on("request", (request) => {
  if (request.url.includes(".png")) {
    request.abort(404);
    return;
  } else {
    request.continue();
  }
});
// log network requests
// install this before `page.goto()`
page.on("response", (response) => {
  // allow XHR only
  // if ("xhr" !== response.request().resourceType()) {
  //   return;
  // }
  console.log(`[${response.request().resourceType()}] ${response.url()}`);
});

Nightwatch.js

Nightwatch.js | Node.js powered End-to-End testing framework
nightwatchjs/nightwatch: End-to-end testing framework written in Node.js and using the W3C Webdriver API

Playwright

targets all the popular rendering engines
binding in Node.js, Python, Java, .NET

Fast and reliable end-to-end testing for modern web apps | Playwright

microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
microsoft/playwright-dotnet: .NET version of the Playwright testing and automation library.

nearform/playwright-setup: Barebones playwright testing framework

Mastering Web Scraping in Python: Avoid Blocking Like a Ninja - ZenRows

Crawlee · Build reliable crawlers. Fast. | Crawlee
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.

Python

Image Scraping with Python - Towards Data Science
reanalytics-databoutique/webscraping-open-project: Repository of open knowledge about web scraping in Python
s0md3v/Photon: Incredibly fast crawler designed for OSINT.

binux/pyspider: A Powerful Spider(Web Crawler) System in Python.
Introduction - pyspider

MechanicalSoup/MechanicalSoup: A Python library for automating interaction with websites.
Welcome to MechanicalSoup’s documentation!

Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML documentation

Web Scraping 101 with Python
A beginner's guide to web scraping with Python | Opensource.com

Practical Introduction to Web Scraping in Python – Real Python
An Intro to Web Scraping With lxml and Python – Python Tips
Web Scraping with Python: A Comprehensive Tutorial requests+BeautifulSoup

Download Course Materials with A Simple Python Crawler XPath

Mastering Web Scraping in Python: From Zero to Hero - ZenRows
Web scraping and parsing with Beautiful Soup & Python - YouTube

Newspaper3k: Article scraping & curation — newspaper 0.0.2 documentation
codelucas/newspaper: News, full-text, and article metadata extraction in Python 3. Advanced docs:

michaelhelmick/lassie: Web Content Retrieval for Humans™
qinxuye/cola: A high-level distributed crawling framework.

Scrapy

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Scrapy documentation
Scrapy Cloud Scraping as a Service, with 7 day data retention

The Scrapinghub Blog – Turn Web Content Into Useful Data
Scraping the Steam Game Store with Scrapy – The Scrapinghub Blog
How to Build your own Price Monitoring Tool – The Scrapinghub Blog

woob

woob - Web Outside of Browsers

JavaScript

How to perform web-scraping using Node.js – Bits and Pieces
How to Perform Web-Scraping using Node.js- Part 2 – Bits and Pieces

ChukwuEmekaAjah/beautiful-dom: A JavaScript library that models essential HTML DOM API methods and properties relevant for extracting data from crawled web pages or XML documents
Beautiful-dom; a HTML parser built with TypeScript - DEV Community 👩‍💻👨‍💻

postlight/parser: 📜 Extract meaningful content from the chaos of a web page

cheeriojs/cheerio: Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
cheeriojs/cheerio-select: CSS selector engine supporting jQuery selectors, based on css-select

Go

antchfx/antch: Antch, a fast, powerful and extensible web crawling & scraping framework for Go 😴inactive
antchfx/antch-getstarted

Web Scraping with Go | DevDungeon
PuerkitoBio/goquery: A little like that j-thing, only in Go.

tech-engine/goscrapy: GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.

antchfx/htmlquery: htmlquery is golang XPath package for HTML query.
antchfx/xpath: XPath package for Golang, supports HTML, XML, JSON document query.

bitfield/weaver: A simple link checker in Go rate limiting

PHP

PHP: DOMDocument::loadHTML - Manual
How to Parse HTML using PHP Native Classes

PHP Scraper - An opinionated web-scraping library for PHP

paquettg/php-html-parser: An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

PHP Simple HTML DOM Parser

Ruby

Capybara source Ruby, multiple drivers
Capybara and Selenium for Testing and Scraping - via @codeship | via @codeship