Add option to respect robots.txt disallows (#888)

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-26 03:40:19 +00:00

Fixes #631 
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>

This commit is contained in:

Tessa Walsh

2025-11-26 22:00:06 -05:00

committed by

GitHub

parent 75a0c9a305

commit 1d15a155f2

9 changed files with 247 additions and 5 deletions

									
										1

package.json
									
												View File
												
				@@ -34,6 +34,7 @@

				    "pixelmatch": "^5.3.0",

				    "pngjs": "^7.0.0",

				    "puppeteer-core": "^24.30.0",

				    "robots-parser": "^3.0.1",

				    "sax": "^1.3.0",

				    "sharp": "^0.32.6",

				    "tsc": "^2.0.4",

Add option to respect robots.txt disallows (#888)

1 package.json Unescape Escape Copy filename View File

1

package.json

View File