Automating an headless browser is preferred over parsing the downloaded HTML for the browser can execute JavaScript
Web crawler - Wikiwand
Web scraping - Wikiwand
lorien/awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
BruceDone/awesome-crawler: A collection of awesome web crawler,spider in different languages
THE SCRAPINGHUB BLOG
A guide to Web Scraping without getting blocked · daolf
A guide to Web Scraping without getting blocked
Introduction to web scraping
Web scraping for beginners | Apify Documentation
Big Data: What is Web Scraping and how to use it | IT Svit Blog
Big Data Scraping vs Web Data Crawling | IT Svit Blog
Turn Websites into structured data /Dataflow kit
Knowledge Graph, AI Web Data Extraction and Crawling | Diffbot
CSS Selector Capture Pro
SelectorGadget
⚙️ Explain Selenium & Webdrivers automation (Like I'm Five) - DEV Community 👩💻👨💻
Google Open Source Blog: Introducing WebDriver
WebdriverIO · Next-gen WebDriver test framework for Node.js
Browserbase: Headless browsers for AI agents & applications Browser scraping as a service
task runner for browser tests:
testem/testem: Test'em 'Scripts! A test runner that makes Javascript unit testing fun.
substack/testling: unit tests in all the browsers
Introducing fuite: a tool for finding memory leaks in web apps | Read the Tea Leaves
nolanlawson/fuite: A tool for finding memory leaks in web apps
Selenium
Selenium - Web Browser Automation
Selenium, Travis-CI and WebRTC == <&
Using Python with Selenium to Automate Mouse Clicks and Filling Forms
Advanced Automation Tips with Python | Selenium - DEV Community 👩💻👨💻
How I DIY’d my Budget Using Python for Selenium and Beautiful Soup | by Jennifer Kim | Towards Data Science
Learn How to Automate Browser Testing With Selenium WebDriver — Part 1 - DZone DevOps
Automate 99% of Websites with Selenium 4 and Python | by Frank Andrade | Geek Culture | May, 2022 | Medium
Sahi (software) - Wikiwand
Sahi - Web Automation and Test Tool download | SourceForge.net
5 Best Python Frameworks for WebView Testing | Codementor
Robot Framework
Robot Framework documentation
Robot Framework Introduction
QuickStartGuide/QuickStart.rst at master · robotframework/QuickStartGuide
Crawlee
builds on top of Puppeteer and Playwright
Crawlee · Build reliable crawlers. Fast.
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Headless Chrome
Learn Playwright & Puppeteer | Checkly originally theheadless.dev
checkly/theheadless.dev: 🪖 Learn Puppeteer and Playwright - Tips, tricks and in-depth guides from the trenches. 😴inactive
Getting Started with Headless Chrome - Chrome Developers
Chrome’s Headless mode gets an upgrade: introducing --headless=new
- Chrome Developers ❗!important
Headless Chromium
yujiosaka/headless-chrome-crawler: Distributed crawler powered by Headless Chrome
Puppeteer
v23 added Firefox support
Puppeteer | Puppeteer
API Reference | Puppeteer
Puppeteer Guides | Puppeteer
Puppeteer - Chrome for Developers
Bun have problem running post-install script
bunx @puppeteer/browsers install chrome@stable --path $HOME/.cache/puppeteer
puppeteer/puppeteer: Node.js API for Chrome
Puppeteer Tutorial
Introduction to Puppeteer
Automating Google Chrome with Node.js - Tutorialzine with Puppeteer
The Guide To Ethical Scraping Of Dynamic Websites With Node.js And Puppeteer — Smashing Magazine
Tutorial: User Interface Testing with Jest and Puppeteer
Browser automation revisited - meet Puppeteer | Gergely Nemeth slowMo: 250, //ms
, profiling, intercepting requests
Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages
Puppeteer – Headless Chrome in a Container – zwischenzugs
How to set up a Headless Chrome Node.js server in Docker - LogRocket Blog
Koan
Puppeteer quick start - Chrome Developers
checkly/puppeteer-examples: Puppeteer example scripts for running Headless Chrome from Node. 😴inactive
Logging into a website | Apify Documentation
// `page.evaluate()` will serialize the return
// use this trick to return a non-serialize object
// https://github.com/puppeteer/puppeteer/issues/3986
page.evaluate(await page.evaluate(() => window.toString()));
// get all links
const hrefs = await page.$$eval("a", (as) => as.map((a) => a.href));
console.log(hrefs);
// get property of element
const video = await page.waitForSelector("video.player");
const src = await video.getProperty("src");
// slow down browser operations
const browser = await puppeteer.launch({
headless: false,
slowMo: 250, //ms
});
const page = await browser.newPage();
await page.setRequestInterception(true);
// intercept requests
// install this before `page.goto()`
page.on("request", (request) => {
if (request.url.includes(".png")) {
request.abort(404);
return;
} else {
request.continue();
}
});
// log network requests
// install this before `page.goto()`
page.on("response", (response) => {
// allow XHR only
// if ("xhr" !== response.request().resourceType()) {
// return;
// }
console.log(`[${response.request().resourceType()}] ${response.url()}`);
});
Nightwatch.js
Nightwatch.js | Node.js powered End-to-End testing framework
nightwatchjs/nightwatch: End-to-end testing framework written in Node.js and using the W3C Webdriver API
Playwright
targets all the popular rendering engines
binding in Node.js, Python, Java, .NET
Fast and reliable end-to-end testing for modern web apps | Playwright
microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
microsoft/playwright-dotnet: .NET version of the Playwright testing and automation library.
nearform/playwright-setup: Barebones playwright testing framework
Mastering Web Scraping in Python: Avoid Blocking Like a Ninja - ZenRows
Crawlee · Build reliable crawlers. Fast. | Crawlee
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
Python
Image Scraping with Python - Towards Data Science
reanalytics-databoutique/webscraping-open-project: Repository of open knowledge about web scraping in Python
s0md3v/Photon: Incredibly fast crawler designed for OSINT.
binux/pyspider: A Powerful Spider(Web Crawler) System in Python.
Introduction - pyspider
MechanicalSoup/MechanicalSoup: A Python library for automating interaction with websites.
Welcome to MechanicalSoup’s documentation!
Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML documentation
Web Scraping 101 with Python
A beginner's guide to web scraping with Python | Opensource.com
Practical Introduction to Web Scraping in Python – Real Python
An Intro to Web Scraping With lxml and Python – Python Tips
Web Scraping with Python: A Comprehensive Tutorial requests+BeautifulSoup
Download Course Materials with A Simple Python Crawler XPath
Mastering Web Scraping in Python: From Zero to Hero - ZenRows
Web scraping and parsing with Beautiful Soup & Python - YouTube
Newspaper3k: Article scraping & curation — newspaper 0.0.2 documentation
codelucas/newspaper: News, full-text, and article metadata extraction in Python 3. Advanced docs:
michaelhelmick/lassie: Web Content Retrieval for Humans™
qinxuye/cola: A high-level distributed crawling framework.
Scrapy
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Scrapy documentation
Scrapy Cloud Scraping as a Service, with 7 day data retention
The Scrapinghub Blog – Turn Web Content Into Useful Data
Scraping the Steam Game Store with Scrapy – The Scrapinghub Blog
How to Build your own Price Monitoring Tool – The Scrapinghub Blog
woob
woob - Web Outside of Browsers
JavaScript
How to perform web-scraping using Node.js – Bits and Pieces
How to Perform Web-Scraping using Node.js- Part 2 – Bits and Pieces
ChukwuEmekaAjah/beautiful-dom: A JavaScript library that models essential HTML DOM API methods and properties relevant for extracting data from crawled web pages or XML documents
Beautiful-dom; a HTML parser built with TypeScript - DEV Community 👩💻👨💻
postlight/parser: 📜 Extract meaningful content from the chaos of a web page
cheeriojs/cheerio: Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
cheeriojs/cheerio-select: CSS selector engine supporting jQuery selectors, based on css-select
Go
antchfx/antch: Antch, a fast, powerful and extensible web crawling & scraping framework for Go 😴inactive
antchfx/antch-getstarted
Web Scraping with Go | DevDungeon
PuerkitoBio/goquery: A little like that j-thing, only in Go.
antchfx/htmlquery: htmlquery is golang XPath package for HTML query.
antchfx/xpath: XPath package for Golang, supports HTML, XML, JSON document query.
bitfield/weaver: A simple link checker in Go rate limiting
PHP
PHP: DOMDocument::loadHTML - Manual
How to Parse HTML using PHP Native Classes
PHP Scraper - An opinionated web-scraping library for PHP
Ruby
Capybara source Ruby, multiple drivers
Capybara and Selenium for Testing and Scraping - via @codeship | via @codeship