mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-12-26 03:40:19 +00:00
Add option to respect robots.txt disallows (#888)
Fixes #631 - Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler. - Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x' - Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types. - Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K - Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all' - Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
This commit is contained in:
@@ -34,6 +34,7 @@
|
||||
"pixelmatch": "^5.3.0",
|
||||
"pngjs": "^7.0.0",
|
||||
"puppeteer-core": "^24.30.0",
|
||||
"robots-parser": "^3.0.1",
|
||||
"sax": "^1.3.0",
|
||||
"sharp": "^0.32.6",
|
||||
"tsc": "^2.0.4",
|
||||
|
||||
Reference in New Issue
Block a user