browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-24 19:10:15 +00:00

Author	SHA1	Message	Date
Tessa Walsh	0cf6219d80	Fix --overwrite CLI flag (#220 ) * Delete collection if --overwrite before wb-manager init * Add tests	2023-02-02 21:02:47 -08:00
Ilya Kreymer	10e61d4c85	Bump to Chrome 109, Beta 0.8.0-beta.1 Release (#215 ) - bump to chrome-109 image - bump uwsgi to fix intermittent build errors -remove installs moved to base image bump to 0.8.0-beta.1 0.8.0-beta.1	2023-01-30 19:00:33 -08:00
Ilya Kreymer	38a9dbdaae	behaviors: don't run behaviors in iframes that are about:blank or are… (#211 ) * behaviors: don't run behaviors in iframes that are about:blank or are from an ad-host (even if ad-blocking is not disabled), fixes #210 * logging: log behavior wait start and success, in addition to error, with url in details	2023-01-23 16:47:33 -08:00
Tessa Walsh	c0b0d5b87f	Serialize Redis pending pages as JSON objects (#212 ) * Add redis:// prefix to test --redisStoreUrl * Serialize pending pages as JSON objects	2023-01-23 16:44:03 -08:00
Ilya Kreymer	a767721f5e	crawl state: add getPendingList() to return pending state from either… (#205 ) * crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object logging: check if data object is an error, log fields from error. Convert missing console.* to new logger * evaluate failuire: log with error, not fatal	2023-01-23 10:43:12 -08:00
Tessa Walsh	1a066dbd7b	Add RedisCrawlState test (#208 )	2023-01-23 10:16:22 -08:00
kuechensofa	f9df7a94ce	Add requests[socks] python dependency (#201 ) Add requests[socks] python dependency to enable SOCKS proxy support for pywb inside the docker container	2023-01-19 21:55:07 -08:00
Tessa Walsh	0192d05f4c	Implement improved json-l logging - Add Logger class with methods for info, error, warn, debug, fatal - Add context, timestamp, and details fields to log entries - Log messages as JSON Lines - Replace puppeteer-cluster stats with custom stats implementation - Log behaviors by default - Amend argParser to reflect logging changes - Capture and log stdout/stderr from awaited child_processes - Modify tests to use webrecorder.net to avoid timeouts	2023-01-19 14:17:27 -05:00
Ilya Kreymer	2b03e23174	arg parsing fix: (#200 ) - check if array of scope includes is actually empty before using it over scope - check if screenshot arg setting is empty 0.8.0-beta.0	2023-01-12 19:58:04 -08:00
Ilya Kreymer	5ee05985b1	Use VNC for headful profile creation (#197 ) * profiles: use vnc for automatic profile creation (fixes #194): - add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode - use @novnc/novnc to serve vnc JS library - add novnc_lite.html to serve the content from an iframe - optimization: don't show initial blank page / don't wait for initial page in puppeteer * more vnc work: - set position of browser at 0,0, avoid needing offset to fit - add /vncpass endpoint to query vnc password (for use with browsertrix-cloud) - remove websockify, x11vnc now supports ws connections directly! - vnc_lite: support reconnecting ws if gracefully disconnected * x11vnc cleanup: just pass password via cmdline to simplify setup * make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified README updates: - mention new VNC-based streaming - mention new --automated flag, move automated info below interactive * README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently	2023-01-09 23:56:53 -08:00
Ed Summers	33a153ac54	remove unused parts of config (#198 ) remove commented out config options (enable-auto-fetch and auto-index) to avoid confusion	2023-01-04 17:00:22 -08:00
Tessa Walsh	f35d495103	Add screenshot functionality (#188 ) * Add screenshot and thumbnail functionality Introduces a --screenshot CLI option, which takes a comma-separated list of screenshot types: view,fullPage,thumbnail. In addition, this commit: - Adds '--experimental-global-webcrypto' to ensure webcrypto is available in node - Deprecates newContext, instead always using page context for 1 worker and window context for >1 worker * Separate screenshotTypes into exported const Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>	2022-12-21 09:06:13 -08:00
Ilya Kreymer	057cc82897	new setting: add support for specifying language via the --lang flag (#186 )	2022-11-21 11:59:37 -08:00
Ilya Kreymer	b268c02823	package: fix license string in package.json	2022-11-21 09:20:15 -08:00
Ilya Kreymer	2a1e0edf3c	version: set version correctly to 0.8.0-beta.0	2022-11-15 18:30:27 -08:00
Ilya Kreymer	cacf5da5a1	esm conversion: finish esm conversion for create-login-profile.js	2022-11-15 18:30:27 -08:00
Tessa Walsh	e02058f001	Add ad blocking via request interception (#173 ) * ad blocking via request interception, extending block rules system, adding new AdBlockRules * Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build * Enabled via --blockAds and setting a custom message via --adBlockMessage * new test to check for ad blocking * Add test-crawls dir to .gitignore and .dockerignore	2022-11-15 18:30:27 -08:00
Ilya Kreymer	277314f2de	Convert to ESM (#179 ) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0	2022-11-15 18:30:27 -08:00
Tim	5b738bd24e	Fix incorrect `combineWARCs` property in README.md (#180 ) This stumped me for a little while. The actual property isn't plural. 0.7.1	2022-11-14 22:17:44 -08:00
Ed Summers	cd17764b77	Check if group/user exists (#176 ) Ensure that group and user do not already exist before creating them. Fixes #174	2022-11-03 17:28:13 -07:00
Ilya Kreymer	ffa3174578	Fix for warcio.js (#178 ) * dependency fix: set warcio to 1.5.1 until we update to esm support bump test timeout fixes #175 bump to 0.7.1	2022-10-24 08:20:01 +02:00
Ilya Kreymer	1213694dde	bump to 0.7.0 for release! 0.7.0	2022-10-11 16:14:53 -07:00
Ilya Kreymer	be3b6b85fa	README: update default behaviors in README, fixes #169	2022-10-11 15:33:32 -07:00
Ed Summers	3ba64535a5	Run in Docker as User (#171 ) * Run in Docker as User This follows a similar pattern to pywb to run as the user that owns the crawls directory. bump version to 0.7.0-beta.6 Closes #170	2022-09-28 12:49:52 -07:00
Ilya Kreymer	65933c6b12	Interrupt Handling Fixes (#167 ) * interrupts: simplify interrupt behavior: - SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load improvements of remote state / parallel crawlers (for browsertrix-cloud): - SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl - SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler * scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope! 0.7.0-beta.5	2022-09-20 17:09:52 -07:00
Ilya Kreymer	fd1737962b	dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165 ) bump to 0.7.0-beta.5	2022-09-15 23:13:31 -07:00
Ilya Kreymer	314ee3f730	Default Wait-Time Improvements (#162 ) - netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds - default behaviors: include autoscroll in default behavior as well - restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting. - bump to puppeteer-core 17.1.2 - bump to 0.7.0-beta.4 0.7.0-beta.4	2022-09-08 23:39:26 -07:00
Ilya Kreymer	5c931275ed	pending wait: set max pending request wait to 120 seconds 0.7.0-beta.3	2022-09-02 17:53:04 -07:00
Ilya Kreymer	a52ee5ed1f	dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3	2022-09-02 17:45:16 -07:00
Ilya Kreymer	e22d95e2f0	Logging and browser improvements: (#158 ) * logging: add 'jserrors' option to --logging to print JS errors * browser config: use flags from playwright * browser: use socat to allow connecting via devtools via crawling on port 9222	2022-08-21 00:30:25 -07:00
Ilya Kreymer	6cc38bf511	Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements (#157 ) * new window: use cdp instead of window.open * new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page * rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js potential fix for #156 * browser repair: - when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur - mark pages as failed and don't reuse if page error or cdp errors occur - screencaster: clear previous targets if screencasting when repairing browser * bump version to 0.7.0-beta.3	2022-08-19 09:23:40 -07:00
Ilya Kreymer	827c153679	fix for latest puppeteer: page._client -> page._client() 0.7.0-beta.2	2022-08-17 21:40:10 -07:00
Ilya Kreymer	c5d208024a	Wait Default + Logging Improvements (#153 ) improved logging of pywb + redis: - if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log - otherwise, just ignore (don't print to stdout as that's too confusing) - print if wb-manager fails, likely due to existing collection waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait dependency: update to latest puppeteer-core (16.1.0)	2022-08-11 18:44:39 -07:00
raffaele messuti	a527cc9b36	Update README.md (#147 ) fix link to puppeteer waitUntil	2022-08-11 18:28:54 -07:00
Ilya Kreymer	e3b8b5ba21	Add --netIdleWait, bump dependencies (0.7.0-beta.2) (#145 ) - add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load - add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl - update to browsertrix-behaviors 0.3.2 - update current url for screencasting of page before page load starts bump to 0.7.0-beta.2	2022-07-08 17:17:46 -07:00
Ilya Kreymer	bd10f1ad8c	bump to 0.7.0-beta.1 0.7.0-beta.1	2022-07-03 11:11:11 -07:00
Ilya Kreymer	82c771f7cd	ci: possibly fix for ci release build (issues building uwsgi)	2022-07-03 11:09:06 -07:00
Ilya Kreymer	0a309af740	Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144 ) * update base image - switch to browsertrix-base-image:101 with chrome/chromium 101, - includes additional fonts and ubuntu 22.04 as base. - add --disable-site-isolation-trials as default flag to support behaviors accessing iframes * debugging support for shared redis state: - support pausing crawler indefinitely if crawl state is set to 'debug' - must be set/unset manually via external redis - designed for browsertrix-cloud for now bump to 0.7.0-beta.0 0.7.0-beta.0	2022-06-30 19:24:26 -07:00
Ilya Kreymer	cf90304fa7	0.6.0 Wait State + Screencasting Fixes (#141 ) * new options: - to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished - when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished. - set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable) - when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting. - also write WACZ if exiting due to size limit exceed, but not do to other interruptions - change sleep() to be in seconds * misc fixes: - crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue - don't fail crawl if isFinished() is true - don't keep looping in pending wait for urls to finish if received abort request * screencast improvements (fix related to webrecorder/browsertrix-cloud#233) - more optimized screencasting, don't close and restart after every page. - don't assume targets change after every page, they don't in window mode! - only send 'close' message when target is actually closed * bump to 0.6.0 0.6.0	2022-06-17 11:58:44 -07:00
Ilya Kreymer	e7eb6a6620	create profile: fix typo in cookie settings, multiply by seconds in day uwsgi: set number of workers to be 2x cpus by default	2022-06-01 09:11:11 -07:00
Ilya Kreymer	70ba9241ca	limit interrupt fix: after self-interrupting, only look at local pending list (for redis state) logging: don't log CF check errors, do log when errorCount is reset 0.6.0-beta.1	2022-05-19 06:25:46 +00:00
Ilya Kreymer	6ec47cdd14	profile creation: when creating a profile, force all cookies to have a duration to avoid expiring session cookies (#139 ) - save cookies on page load and also before profile creation - default cookie duration is 7 days, configurable via --cookieDays option	2022-05-18 23:23:32 -07:00
Ilya Kreymer	93b6dad7b9	Health Check + Size Limits + Profile fixes (#138 ) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1	2022-05-18 22:51:55 -07:00
Ilya Kreymer	500ed1f9a1	Profile Creation Improvements (#136 ) * interactive profile api improvements: - refactor profile creation into separate class - if profile starts with '@', load as relative path using current s3 storage - support uploading profiles to s3 - profile api: support filename passed to /createProfieJS as part of json POST - profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping) - profile api: add /target to retrieve target and /navigate to navigate by url. * bump to 0.6.0-beta.0	2022-05-05 14:27:17 -05:00
Ilya Kreymer	5dfbfbeaf6	update dependencies: (#134 ) - update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX - update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction - update browsertrix-behaviors to 0.3.0, support for telegram behavior - bump version to 0.5.1 0.5.1	2022-04-15 16:22:47 -07:00
Ilya Kreymer	9b938304ce	dependencies: update to pywb>=2.6.6, wacz>=0.4.5 0.5.0	2022-04-11 15:09:59 -07:00
Ilya Kreymer	cc391146c4	package: set minio version to fixed (7.0.26)	2022-04-09 22:07:17 -07:00
Ilya Kreymer	bfd72835d1	update CHANGES for 0.5.0 release	2022-04-09 21:59:44 -07:00
Ilya Kreymer	7ed5586bdb	scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity 0.5.0-beta.8	2022-03-22 17:43:13 -07:00
Ilya Kreymer	5afd19f43d	Non-HTML Page Load Optimization (#130 ) * non-html page load improvements: fix for #129 - don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors - don't do text extraction for non-HTML pages (will need to handle pdf separately) bump to 0.5.0-beta.8	2022-03-22 17:41:51 -07:00

1 2 3

143 Commits