leesei's Second Brain 🧠

Browser automation

October 15, 2023

November 20, 2017

Automates browser, for web scraping and testing

Web crawler - Wikiwand
Web scraping - Wikiwand

lorien/awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
BruceDone/awesome-crawler: A collection of awesome web crawler,spider in different languages

THE SCRAPINGHUB BLOG
A guide to Web Scraping without getting blocked · daolf
A guide to Web Scraping without getting blocked
Introduction to web scraping
Web scraping for beginners | Apify Documentation

Big Data: What is Web Scraping and how to use it | IT Svit Blog
Big Data Scraping vs Web Data Crawling | IT Svit Blog

Turn Websites into structured data /Dataflow kit
Knowledge Graph, AI Web Data Extraction and Crawling | Diffbot

CSS Selector Capture Pro - Chrome Web Store
SelectorGadget - Chrome Web Store

⚙️ Explain Selenium & Webdrivers automation (Like I'm Five) - DEV Community 👩‍💻👨‍💻
Google Open Source Blog: Introducing WebDriver
WebdriverIO · Next-gen WebDriver test framework for Node.js

task runner for browser tests:
testem/testem: Test'em 'Scripts! A test runner that makes Javascript unit testing fun.
substack/testling: unit tests in all the browsers

Introducing fuite: a tool for finding memory leaks in web apps | Read the Tea Leaves
nolanlawson/fuite: A tool for finding memory leaks in web apps

Selenium

Selenium - Web Browser Automation
Selenium, Travis-CI and WebRTC == <&
Using Python with Selenium to Automate Mouse Clicks and Filling Forms
Advanced Automation Tips with Python | Selenium - DEV Community 👩‍💻👨‍💻
How I DIY’d my Budget Using Python for Selenium and Beautiful Soup | by Jennifer Kim | Towards Data Science
Learn How to Automate Browser Testing With Selenium WebDriver — Part 1 - DZone DevOps
Automate 99% of Websites with Selenium 4 and Python | by Frank Andrade | Geek Culture | May, 2022 | Medium

Sahi (software) - Wikiwand
Sahi - Web Automation and Test Tool download | SourceForge.net

5 Best Python Frameworks for WebView Testing | Codementor

Robot Framework
Robot Framework documentation
Robot Framework Introduction
QuickStartGuide/QuickStart.rst at master · robotframework/QuickStartGuide

Nightwatch.js

Nightwatch.js | Node.js powered End-to-End testing framework
nightwatchjs/nightwatch: End-to-end testing framework written in Node.js and using the W3C Webdriver API

Headless Chrome

Learn Playwright & Puppeteer | Checkly originally theheadless.dev
checkly/theheadless.dev: 🪖 Learn Puppeteer and Playwright - Tips, tricks and in-depth guides from the trenches. 😴inactive

Getting Started with Headless Chrome - Chrome Developers
Chrome’s Headless mode gets an upgrade: introducing --headless=new - Chrome Developers ❗!important
Headless Chromium

yujiosaka/headless-chrome-crawler: Distributed crawler powered by Headless Chrome

Puppeteer

Puppeteer | Puppeteer
API Reference | Puppeteer
Puppeteer Guides | Puppeteer
Puppeteer - Chrome for Developers

puppeteer/puppeteer: Node.js API for Chrome

Puppeteer Tutorial
Introduction to Puppeteer
Automating Google Chrome with Node.js - Tutorialzine with Puppeteer
The Guide To Ethical Scraping Of Dynamic Websites With Node.js And Puppeteer — Smashing Magazine
Tutorial: User Interface Testing with Jest and Puppeteer
Browser automation revisited - meet Puppeteer | Gergely Nemeth slowMo: 250, //ms, profiling, intercepting requests
Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages

Puppeteer – Headless Chrome in a Container – zwischenzugs
How to set up a Headless Chrome Node.js server in Docker - LogRocket Blog

Koan

Puppeteer quick start - Chrome Developers

checkly/puppeteer-examples: Puppeteer example scripts for running Headless Chrome from Node. 😴inactive

Logging into a website | Apify Documentation

📷 How to take a screenshot of a webpage with JavaScript in Node.js (using puppeteer) - DEV Community 👩‍💻👨‍💻

// `page.evaluate()` will serialize the return
// use this trick to return a non-serialize object
// https://github.com/puppeteer/puppeteer/issues/3986
page.evaluate(await page.evaluate(() => window.toString()));

// get all links
const hrefs = await page.$$eval("a", (as) => as.map((a) => a.href));
console.log(hrefs);

// get property of element
const video = await page.waitForSelector("video.player");
const src = await video.getProperty("src");

// slow down browser operations
const browser = await puppeteer.launch({
  headless: false,
  slowMo: 250, //ms
});

const page = await browser.newPage();
await page.setRequestInterception(true);
// intercept requests
// install this before `page.goto()`
page.on("request", (request) => {
  if (request.url.includes(".png")) {
    request.abort(404);
    return;
  } else {
    request.continue();
  }
});
// log network requests
// install this before `page.goto()`
page.on("response", (response) => {
  // allow XHR only
  // if ("xhr" !== response.request().resourceType()) {
  //   return;
  // }
  console.log(`[${response.request().resourceType()}] ${response.url()}`);
});

Playwright

targets all the popular rendering engines
binding in Node.js, Python, Java, .NET

Fast and reliable end-to-end testing for modern web apps | Playwright

microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
microsoft/playwright-dotnet: .NET version of the Playwright testing and automation library.

nearform/playwright-setup: Barebones playwright testing framework

Mastering Web Scraping in Python: Avoid Blocking Like a Ninja - ZenRows

Crawlee · Build reliable crawlers. Fast. | Crawlee
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.

Python

Image Scraping with Python - Towards Data Science
reanalytics-databoutique/webscraping-open-project: Repository of open knowledge about web scraping in Python
s0md3v/Photon: Incredibly fast crawler designed for OSINT.

binux/pyspider: A Powerful Spider(Web Crawler) System in Python.
Introduction - pyspider

MechanicalSoup/MechanicalSoup: A Python library for automating interaction with websites.
Welcome to MechanicalSoup’s documentation!

Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML documentation

Web Scraping 101 with Python
A beginner's guide to web scraping with Python | Opensource.com

Practical Introduction to Web Scraping in Python – Real Python
An Intro to Web Scraping With lxml and Python – Python Tips
Web Scraping with Python: A Comprehensive Tutorial requests+BeautifulSoup

Download Course Materials with A Simple Python Crawler XPath

Mastering Web Scraping in Python: From Zero to Hero - ZenRows
Web scraping and parsing with Beautiful Soup & Python - YouTube

Newspaper3k: Article scraping & curation — newspaper 0.0.2 documentation
codelucas/newspaper: News, full-text, and article metadata extraction in Python 3. Advanced docs:

michaelhelmick/lassie: Web Content Retrieval for Humans™
qinxuye/cola: A high-level distributed crawling framework.

Scrapy

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Scrapy documentation
Scrapy Cloud Scraping as a Service, with 7 day data retention

The Scrapinghub Blog – Turn Web Content Into Useful Data
Scraping the Steam Game Store with Scrapy – The Scrapinghub Blog
How to Build your own Price Monitoring Tool – The Scrapinghub Blog

woob

woob - Web Outside of Browsers

JavaScript

How to perform web-scraping using Node.js – Bits and Pieces
How to Perform Web-Scraping using Node.js- Part 2 – Bits and Pieces

ChukwuEmekaAjah/beautiful-dom: A JavaScript library that models essential HTML DOM API methods and properties relevant for extracting data from crawled web pages or XML documents
Beautiful-dom; a HTML parser built with TypeScript - DEV Community 👩‍💻👨‍💻

postlight/parser: 📜 Extract meaningful content from the chaos of a web page

cheeriojs/cheerio: Fast, flexible, and lean implementation of core jQuery designed specifically for the server.

Others

Web Scraping with Go | DevDungeon
PuerkitoBio/goquery: A little like that j-thing, only in Go.

segmentio/nightmare: A high-level browser automation library. uses Electron (Chromium inside)
CasperJS, a navigation scripting and testing utility for PhantomJS and SlimerJS Webkit/Gecko

Capybara source Ruby, multiple drivers
Capybara and Selenium for Testing and Scraping - via @codeship | via @codeship