Web crawler - Wikiwand
Web scraping - Wikiwand
lorien/awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
BruceDone/awesome-crawler: A collection of awesome web crawler,spider in different languages
THE SCRAPINGHUB BLOG
A guide to Web Scraping without getting blocked · daolf
A guide to Web Scraping without getting blocked
Introduction to web scraping
Web scraping for beginners | Apify Documentation
Big Data: What is Web Scraping and how to use it | IT Svit Blog
Big Data Scraping vs Web Data Crawling | IT Svit Blog
Turn Websites into structured data /Dataflow kit
Knowledge Graph, AI Web Data Extraction and Crawling | Diffbot
CSS Selector Capture Pro - Chrome Web Store
SelectorGadget - Chrome Web Store
⚙️ Explain Selenium & Webdrivers automation (Like I'm Five) - DEV Community 👩💻👨💻
Google Open Source Blog: Introducing WebDriver
WebdriverIO · Next-gen WebDriver test framework for Node.js
task runner for browser tests:
testem/testem: Test'em 'Scripts! A test runner that makes Javascript unit testing fun.
substack/testling: unit tests in all the browsers
Introducing fuite: a tool for finding memory leaks in web apps | Read the Tea Leaves
nolanlawson/fuite: A tool for finding memory leaks in web apps
Selenium
Selenium - Web Browser Automation
Selenium, Travis-CI and WebRTC == <&
Using Python with Selenium to Automate Mouse Clicks and Filling Forms
Advanced Automation Tips with Python | Selenium - DEV Community 👩💻👨💻
How I DIY’d my Budget Using Python for Selenium and Beautiful Soup | by Jennifer Kim | Towards Data Science
Learn How to Automate Browser Testing With Selenium WebDriver — Part 1 - DZone DevOps
Automate 99% of Websites with Selenium 4 and Python | by Frank Andrade | Geek Culture | May, 2022 | Medium
Sahi (software) - Wikiwand
Sahi - Web Automation and Test Tool download | SourceForge.net
5 Best Python Frameworks for WebView Testing | Codementor
Robot Framework
Robot Framework documentation
Robot Framework Introduction
QuickStartGuide/QuickStart.rst at master · robotframework/QuickStartGuide
Nightwatch.js
Nightwatch.js | Node.js powered End-to-End testing framework
nightwatchjs/nightwatch: End-to-end testing framework written in Node.js and using the W3C Webdriver API
Headless Chrome
Learn Playwright & Puppeteer | Checkly originally theheadless.dev
checkly/theheadless.dev: 🪖 Learn Puppeteer and Playwright - Tips, tricks and in-depth guides from the trenches. 😴inactive
Getting Started with Headless Chrome - Chrome Developers
Chrome’s Headless mode gets an upgrade: introducing --headless=new
- Chrome Developers ❗!important
Headless Chromium
yujiosaka/headless-chrome-crawler: Distributed crawler powered by Headless Chrome
Puppeteer
Puppeteer | Puppeteer
API Reference | Puppeteer
Puppeteer Guides | Puppeteer
Puppeteer - Chrome for Developers
puppeteer/puppeteer: Node.js API for Chrome
Puppeteer Tutorial
Introduction to Puppeteer
Automating Google Chrome with Node.js - Tutorialzine with Puppeteer
The Guide To Ethical Scraping Of Dynamic Websites With Node.js And Puppeteer — Smashing Magazine
Tutorial: User Interface Testing with Jest and Puppeteer
Browser automation revisited - meet Puppeteer | Gergely Nemeth slowMo: 250, //ms
, profiling, intercepting requests
Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages
Puppeteer – Headless Chrome in a Container – zwischenzugs
How to set up a Headless Chrome Node.js server in Docker - LogRocket Blog
Koan
Puppeteer quick start - Chrome Developers
checkly/puppeteer-examples: Puppeteer example scripts for running Headless Chrome from Node. 😴inactive
Logging into a website | Apify Documentation
// `page.evaluate()` will serialize the return
// use this trick to return a non-serialize object
// https://github.com/puppeteer/puppeteer/issues/3986
page.evaluate(await page.evaluate(() => window.toString()));
// get all links
const hrefs = await page.$$eval("a", (as) => as.map((a) => a.href));
console.log(hrefs);
// get property of element
const video = await page.waitForSelector("video.player");
const src = await video.getProperty("src");
// slow down browser operations
const browser = await puppeteer.launch({
headless: false,
slowMo: 250, //ms
});
const page = await browser.newPage();
await page.setRequestInterception(true);
// intercept requests
// install this before `page.goto()`
page.on("request", (request) => {
if (request.url.includes(".png")) {
request.abort(404);
return;
} else {
request.continue();
}
});
// log network requests
// install this before `page.goto()`
page.on("response", (response) => {
// allow XHR only
// if ("xhr" !== response.request().resourceType()) {
// return;
// }
console.log(`[${response.request().resourceType()}] ${response.url()}`);
});
Playwright
targets all the popular rendering engines
binding in Node.js, Python, Java, .NET
Fast and reliable end-to-end testing for modern web apps | Playwright
microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
microsoft/playwright-dotnet: .NET version of the Playwright testing and automation library.
nearform/playwright-setup: Barebones playwright testing framework
Mastering Web Scraping in Python: Avoid Blocking Like a Ninja - ZenRows
Crawlee · Build reliable crawlers. Fast. | Crawlee
apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
Python
Image Scraping with Python - Towards Data Science
reanalytics-databoutique/webscraping-open-project: Repository of open knowledge about web scraping in Python
s0md3v/Photon: Incredibly fast crawler designed for OSINT.
binux/pyspider: A Powerful Spider(Web Crawler) System in Python.
Introduction - pyspider
MechanicalSoup/MechanicalSoup: A Python library for automating interaction with websites.
Welcome to MechanicalSoup’s documentation!
Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML documentation
Web Scraping 101 with Python
A beginner's guide to web scraping with Python | Opensource.com
Practical Introduction to Web Scraping in Python – Real Python
An Intro to Web Scraping With lxml and Python – Python Tips
Web Scraping with Python: A Comprehensive Tutorial requests+BeautifulSoup
Download Course Materials with A Simple Python Crawler XPath
Mastering Web Scraping in Python: From Zero to Hero - ZenRows
Web scraping and parsing with Beautiful Soup & Python - YouTube
Newspaper3k: Article scraping & curation — newspaper 0.0.2 documentation
codelucas/newspaper: News, full-text, and article metadata extraction in Python 3. Advanced docs:
michaelhelmick/lassie: Web Content Retrieval for Humans™
qinxuye/cola: A high-level distributed crawling framework.
Scrapy
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Scrapy documentation
Scrapy Cloud Scraping as a Service, with 7 day data retention
The Scrapinghub Blog – Turn Web Content Into Useful Data
Scraping the Steam Game Store with Scrapy – The Scrapinghub Blog
How to Build your own Price Monitoring Tool – The Scrapinghub Blog
woob
woob - Web Outside of Browsers
JavaScript
How to perform web-scraping using Node.js – Bits and Pieces
How to Perform Web-Scraping using Node.js- Part 2 – Bits and Pieces
ChukwuEmekaAjah/beautiful-dom: A JavaScript library that models essential HTML DOM API methods and properties relevant for extracting data from crawled web pages or XML documents
Beautiful-dom; a HTML parser built with TypeScript - DEV Community 👩💻👨💻
postlight/parser: 📜 Extract meaningful content from the chaos of a web page
Others
Web Scraping with Go | DevDungeon
PuerkitoBio/goquery: A little like that j-thing, only in Go.
segmentio/nightmare: A high-level browser automation library. uses Electron (Chromium inside)
CasperJS, a navigation scripting and testing utility for PhantomJS and SlimerJS Webkit/Gecko
Capybara source Ruby, multiple drivers
Capybara and Selenium for Testing and Scraping - via @codeship | via @codeship