tags : System Design,Archival
FAQ
Good resources?
- https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero
- https://github.com/lorien/awesome-web-scraping/tree/master : Awesome list of tools
Legal?
- Are website terms of use enforced?
- Web Scraping for Me, But Not for Thee (Guest Blog Post) - Technology & Marketing Law Blog
- incolumitas.com – So you want to Scrape like the Big Boys? 🚀
Headless non-headless
- headless: no gui (eg. webscraping)
- non-headless: gui, visual rendering (eg. if user needs to keep seeing what the automation does)
What are diff kinds of scraping bots?
This list i’ll keep updating
- Sneaker bot: commonly referred to as a “shoe bot”, is a sophisticated software component designed to help individuals quickly purchase limited availability stock.
Tools
Web Scraping projects
Tool Name | Description | Use Case | Links |
---|---|---|---|
BrightData | Developer-focused proxy network and scraping infrastructure | Custom scraping solutions | Website |
Diffbot | AI-powered structured data extraction API | Market research | Website |
ScrapingBee | Headless browser management service | Browser automation | Website |
Apify | Cloud-based platform for web scraping and automation | Large-scale data extraction, automation workflows | Website |
Octoparse | No-code web scraping tool with a user-friendly interface | Non-technical data collection | Website |
Zyte | Formerly Scrapinghub; provides Scrapy framework and managed scraping services | Structured data extraction | Website |
SerpAPI | API for accessing Google search results programmatically | Search engine data collection | Website |
Web Discovery & Mining & Text Processing
Tool Name | Description | Use Case | Links |
---|---|---|---|
Trafilatura | Advanced web scraping library with metadata extraction | Content harvesting | GitHub |
Minet | Python webmining toolkit with CLI interface | Large-scale scraping | GitHub |
postlight/parser | Mercury parser for web content extraction | Article extraction | GitHub |
crawl4ai | Open-Source LLM-Friendly Web Crawler & Scraper | ||
Firecrawl | Open-source tool for extracting clean, LLM-ready data from websites | Web scraping for AI apps | Website |
LLM Scraper | TypeScript library for structured web scraping using LLMs | Web data extraction | GitHub |
OmniParser | Computer vision tool for parsing UI screenshots into structured data | GUI automation agents | GitHub |
simonw/shot-scraper | Takes pixel-perfect screenshots; can be used for change detection | ||
files-to-prompt | Concatenates multiple files into a single prompt for LLM usage | Prepping text for LLM prompts | GitHub |
Markitdown | Markdown-based tool for structuring and organizing content | Content formatting | GitHub |
defuddle-cli | CLI tool to simplify and clean up messy datasets or files | Data cleanup | GitHub |
repomix | Combines multiple code repositories into a single file | Codebase unification | GitHub |
Browser automation
Tool Name | Description | Use Case | Links |
---|---|---|---|
vimGPT/browserGPT | AI-powered automation tools for editors/browsers | Workflow automation | (Community projects) |
Stagehand | AI-assisted browser automation framework | Web testing | GitHub |
Change Detection
Tool Name | Description | Use Case | Links |
---|---|---|---|
urlwatch | Website change monitoring with multiple notification channels | Content tracking | GitHub |
changedetection.io | Self-hosted visual change detection platform | Website monitoring | GitHub |
Changd | Open-source web monitoring tool for visual changes, XPath, and API data | Website change monitoring | GitHub |
Visualping | Commercial service for monitoring webpage changes with alerts and reports | Business intelligence, compliance | Website |
Post-Processing
Tool Name | Description | Use Case | Links |
---|---|---|---|
strip-tags | HTML tag stripping utility | Text cleanup | GitHub |
mailparser | Advanced email parsing library | Email processing | GitHub |
Social Media Tools
Tool Name | Description | Use Case | Links |
---|---|---|---|
twarc2 | Official Twitter archiving and analysis toolkit | Social media research | Docs |
snscrape | Social media scraping toolkit (multiple platforms) | Public data collection | GitHub |
PMAW | Pushshift wrapper for Reddit data | Reddit analysis | GitHub |
Miscellaneous Tools
Tool Name | Description | Use Case | Links |
---|---|---|---|
browser_cookie3 | Browser cookie extraction library | Authentication automation | GitHub |
pdf2htmlEX | PDF to HTML converter | Document processing | GitHub |
- Surfer: Centralize all your personal data from online platforms | Hacker News
- https://github.com/bjesus/pipet
Enumeration & Brute-Force
Tool Name | Description | Use Case | Links |
---|---|---|---|
Legba | Advanced network protocol brute-forcing tool | Security testing | Blog |
Checklist & Best Practices
Checklist
- Using something like wappalyzer find out tech used/projection used etc.
- Does the website have an API (internal or exposed)?
- Does it have some JSON inside the HTML? Eg. site might preload JSON payloads into the initial HTML for hydration.
- Think beyond DOM scraping
- If it’s DOM based scraping and we using Playwright, can we get around using codegen?
- Is the data being served via iframe? in that case we check the source of the frame.
- Does it makes certain requests only from mobile app? TODO: How do we catch these?
- Is the data being rendered via canvas, so no DOM at all? Maybe tools shot-scraper, ishan0102/vimGPT, OpenAdapt,mayt/BrowserGPT can help?
Best practices
Sites with dynamic sessions
- These usually need complex combination of temporary auth token headers which is difficult to do outside the context of the app/expire etc.
- In these cases, we sort of would need to automate the task of “inspecting the network tab”. Application context can help. (See Page.setRequestInterception(), Network Events | Playwright)
- Sometimes they may even be predictable in some way.
Sites with data in the runtime Heap
- Eg. find the apollo client instance in memory, use it to get the data. Profit? (See adriancooney/puppeteer-heap-snapshot, this will work with playwright as-well because uses the CDP).
- This can be slow but nice because even if the UI changes frequently, the underlying data-structure to store the data might not etc.
DOM based scraping
- We try using playwright codegen if possible
- Don’t use XPath&CSS selectors at all (Except if you don’t have choice). You rely on more generic stuff, e.g, “the button that has ‘Sign in’ on it”:
await page.getByRole('button', { name: 'Sign in' }).click();
Other ideas
Crawlee Primer
- currently supports 3 main crawlers
- There’s request and requestQueue that crawlee offers. These are low level
- Every crawler has an implicit RequestQueue instance, and you can add requests to it with the crawler.addRequests() method.
Playwright notes
Injecting scripts
https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code
await page.addInitScript({
path: path.join(injectionsDir, "dismissDialog.js"),
});
// or
await page.exposeFunction(isShown.name, isShown);
I think the benefit of exposeFunction is that we get typesafety for the function, otherwise with addInitScript
it has to be a proper javascript file(non-ts).
Bot detection
Waiting for items to appear
networkidle
is discouraged. See https://github.com/microsoft/playwright/issues/22897
Resources
War stories
- So… I built a Browser Extension to grab the data at a speed that is usually under their detection rate. Basically created a distributed scraper and passed it out to as many people in the league as I could.
- I found that tampermonkey is often much easier to deal with in most cases and also much quicker to develop for
- some sites can block ‘self’ origin scripts by leaving it out of the CSP and only allowing scripts they control served by a CDN
Others
- Cutting-edge web scraping techniques | Lobsters
- The most important HTTP headers for scraping | Colly
- Tracking supermarket prices with Playwright | Hacker News
- Web Scraping: Bypassing “403 Forbidden,” captchas, and more | Hacker News
Antibot stuff
Antibot Protection
If anti-bot detects your fingerprint or you raise suspicion, you get captcha. Idea is to detect which anti-bot mechanism is at play and then use bypassing techniques when scraping. w some anti-bot tools, you may not even need to use headless browser, maybe just using rotating proxies will solve it.
Fingerprinting
See Anonymity
-
Passive
This is usually not under your control. You can try changing devices etc.
- TCP/IP: IPv4 and IPv6 headers, TCP headers, the dynamics of the TCP handshake, and the contents of application-level payloads. (See p0f)
- TLS: The TLS handshake is not encrypted and can be used for finger printing.
- HTTP : Special frames in the packet that differ by clients so that we can fingerprint the client etc. SETTINGS/WINDOW_UPDATE/PRIORITY for 2
-
Active
In this case, the website tries to run certain tests back on you to check if your fingerprint matches and do whatever action it desires to based on that info
- Canvas Fingerprinting: This may try to render something which may render differently in a personal computer vs a vm etc. WebGL Fingerprinting also works similarly.
Products offering protection
- Datadome
- PerimeterX
- Kasada
- Cloudflare
- You could also get creative eg. if we can somehow figure out the origin ip somehow(DNS leak, logs, subdomains etc.). But this would only work if the site admin somehow forgot to add firewalls rules to allow only traffic from cf
- OSS
Antibot solutions
Proxy services
I’ll just say that firefox still runs tampermonkey, and that includes firefox mobile, so depending on how often you need a different IP and how much data you’re getting, you might be able to do away with the whole idea of proxies and just have a few mobile phones that can be configured as workers that take requests through a tampermonkey script. Or that a laptop tethers to that does the same, or that runs puppeteer itself. It depends on whether a worker needs a new IP every few minutes, hours or days as to whether a real mobile phone works (as some manual interaction is often required to actively change the IP). - kbenson
- Residential/Mobile
- 4G rotating proxies??
Captcha solvers
Obfuscate fingerprint
- May require playing w JS
- Manage cookies/headers
- Crack backend APIs and so on.
Other configs
- There are always specific config that you’ll need to trial and error. eg. some sites might not like headless, so you gotta scrape with no-headless or something similar
Pre-made solutions
- These usually do the job of Proxy services + Obfuscating fingerprints
- Bright data, Zyte API, Smart Proxy and Oxylabs Web Unlocker