Commit Graph

143 Commits

Author SHA1 Message Date
Tessa Walsh
0cf6219d80 Fix --overwrite CLI flag (#220)
* Delete collection if --overwrite before wb-manager init

* Add tests
2023-02-02 21:02:47 -08:00
Ilya Kreymer
10e61d4c85 Bump to Chrome 109, Beta 0.8.0-beta.1 Release (#215)
- bump to chrome-109 image
- bump uwsgi to fix intermittent build errors
-remove installs moved to base image
bump to 0.8.0-beta.1
0.8.0-beta.1
2023-01-30 19:00:33 -08:00
Ilya Kreymer
38a9dbdaae behaviors: don't run behaviors in iframes that are about:blank or are… (#211)
* behaviors: don't run behaviors in iframes that are about:blank or are from an ad-host (even if ad-blocking is not disabled), fixes #210

* logging: log behavior wait start and success, in addition to error, with url in details
2023-01-23 16:47:33 -08:00
Tessa Walsh
c0b0d5b87f Serialize Redis pending pages as JSON objects (#212)
* Add redis:// prefix to test --redisStoreUrl

* Serialize pending pages as JSON objects
2023-01-23 16:44:03 -08:00
Ilya Kreymer
a767721f5e crawl state: add getPendingList() to return pending state from either… (#205)
* crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object
logging: check if data object is an error, log fields from error. Convert missing console.* to new logger
* evaluate failuire: log with error, not fatal
2023-01-23 10:43:12 -08:00
Tessa Walsh
1a066dbd7b Add RedisCrawlState test (#208) 2023-01-23 10:16:22 -08:00
kuechensofa
f9df7a94ce Add requests[socks] python dependency (#201)
Add requests[socks] python dependency to enable SOCKS proxy support for pywb inside the docker container
2023-01-19 21:55:07 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00
Ilya Kreymer
2b03e23174 arg parsing fix: (#200)
- check if array of scope includes is actually empty before using it over scope
- check if screenshot arg setting is empty
0.8.0-beta.0
2023-01-12 19:58:04 -08:00
Ilya Kreymer
5ee05985b1 Use VNC for headful profile creation (#197)
* profiles: use vnc for automatic profile creation (fixes #194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer

* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected

* x11vnc cleanup: just pass password via cmdline to simplify setup

* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive

* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
2023-01-09 23:56:53 -08:00
Ed Summers
33a153ac54 remove unused parts of config (#198)
remove commented out config options (enable-auto-fetch and auto-index) to avoid confusion
2023-01-04 17:00:22 -08:00
Tessa Walsh
f35d495103 Add screenshot functionality (#188)
* Add screenshot and thumbnail functionality

Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.

In addition, this commit:

- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker

* Separate screenshotTypes into exported const

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
2022-12-21 09:06:13 -08:00
Ilya Kreymer
057cc82897 new setting: add support for specifying language via the --lang flag (#186) 2022-11-21 11:59:37 -08:00
Ilya Kreymer
b268c02823 package: fix license string in package.json 2022-11-21 09:20:15 -08:00
Ilya Kreymer
2a1e0edf3c version: set version correctly to 0.8.0-beta.0 2022-11-15 18:30:27 -08:00
Ilya Kreymer
cacf5da5a1 esm conversion: finish esm conversion for create-login-profile.js 2022-11-15 18:30:27 -08:00
Tessa Walsh
e02058f001 Add ad blocking via request interception (#173)
* ad blocking via request interception, extending block rules system, adding new AdBlockRules
* Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build
* Enabled via --blockAds and setting a custom message via --adBlockMessage
* new test to check for ad blocking
* Add test-crawls dir to .gitignore and .dockerignore
2022-11-15 18:30:27 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Tim
5b738bd24e Fix incorrect combineWARCs property in README.md (#180)
This stumped me for a little while. The actual property isn't plural.
0.7.1
2022-11-14 22:17:44 -08:00
Ed Summers
cd17764b77 Check if group/user exists (#176)
Ensure that group and user do not already exist before creating them.

Fixes #174
2022-11-03 17:28:13 -07:00
Ilya Kreymer
ffa3174578 Fix for warcio.js (#178)
* dependency fix: set warcio to 1.5.1 until we update to esm support
bump test timeout
fixes #175
bump to 0.7.1
2022-10-24 08:20:01 +02:00
Ilya Kreymer
1213694dde bump to 0.7.0 for release! 0.7.0 2022-10-11 16:14:53 -07:00
Ilya Kreymer
be3b6b85fa README: update default behaviors in README, fixes #169 2022-10-11 15:33:32 -07:00
Ed Summers
3ba64535a5 Run in Docker as User (#171)
* Run in Docker as User

This follows a similar pattern to pywb to run as the user that owns the
crawls directory.

bump version to 0.7.0-beta.6

Closes #170
2022-09-28 12:49:52 -07:00
Ilya Kreymer
65933c6b12 Interrupt Handling Fixes (#167)
* interrupts: simplify interrupt behavior:
- SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load

improvements of remote state / parallel crawlers (for browsertrix-cloud):
- SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl
- SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler

* scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope!
0.7.0-beta.5
2022-09-20 17:09:52 -07:00
Ilya Kreymer
fd1737962b dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165)
bump to 0.7.0-beta.5
2022-09-15 23:13:31 -07:00
Ilya Kreymer
314ee3f730 Default Wait-Time Improvements (#162)
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
- bump to 0.7.0-beta.4
0.7.0-beta.4
2022-09-08 23:39:26 -07:00
Ilya Kreymer
5c931275ed pending wait: set max pending request wait to 120 seconds 0.7.0-beta.3 2022-09-02 17:53:04 -07:00
Ilya Kreymer
a52ee5ed1f dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3 2022-09-02 17:45:16 -07:00
Ilya Kreymer
e22d95e2f0 Logging and browser improvements: (#158)
* logging: add 'jserrors' option to --logging to print JS errors
* browser config: use flags from playwright
* browser: use socat to allow connecting via devtools via crawling on port 9222
2022-08-21 00:30:25 -07:00
Ilya Kreymer
6cc38bf511 Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements (#157)
* new window: use cdp instead of window.open

* new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page

* rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js
potential fix for #156

* browser repair:
- when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur
- mark pages as failed and don't reuse if page error or cdp errors occur
- screencaster: clear previous targets if screencasting when repairing browser

* bump version to 0.7.0-beta.3
2022-08-19 09:23:40 -07:00
Ilya Kreymer
827c153679 fix for latest puppeteer: page._client -> page._client() 0.7.0-beta.2 2022-08-17 21:40:10 -07:00
Ilya Kreymer
c5d208024a Wait Default + Logging Improvements (#153)
improved logging of pywb + redis:
- if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log
- otherwise, just ignore (don't print to stdout as that's too confusing)
- print if wb-manager fails, likely due to existing collection

waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait
dependency: update to latest puppeteer-core (16.1.0)
2022-08-11 18:44:39 -07:00
raffaele messuti
a527cc9b36 Update README.md (#147)
fix link to puppeteer waitUntil
2022-08-11 18:28:54 -07:00
Ilya Kreymer
e3b8b5ba21 Add --netIdleWait, bump dependencies (0.7.0-beta.2) (#145)
- add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load
- add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl
- update to browsertrix-behaviors 0.3.2
- update current url for screencasting of page before page load starts
bump to 0.7.0-beta.2
2022-07-08 17:17:46 -07:00
Ilya Kreymer
bd10f1ad8c bump to 0.7.0-beta.1 0.7.0-beta.1 2022-07-03 11:11:11 -07:00
Ilya Kreymer
82c771f7cd ci: possibly fix for ci release build (issues building uwsgi) 2022-07-03 11:09:06 -07:00
Ilya Kreymer
0a309af740 Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144)
* update base image 
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes

* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now

bump to 0.7.0-beta.0
0.7.0-beta.0
2022-06-30 19:24:26 -07:00
Ilya Kreymer
cf90304fa7 0.6.0 Wait State + Screencasting Fixes (#141)
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished 
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds

* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request

* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed

* bump to 0.6.0
0.6.0
2022-06-17 11:58:44 -07:00
Ilya Kreymer
e7eb6a6620 create profile: fix typo in cookie settings, multiply by seconds in day
uwsgi: set number of workers to be 2x cpus by default
2022-06-01 09:11:11 -07:00
Ilya Kreymer
70ba9241ca limit interrupt fix: after self-interrupting, only look at local pending list (for redis state)
logging: don't log CF check errors, do log when errorCount is reset
0.6.0-beta.1
2022-05-19 06:25:46 +00:00
Ilya Kreymer
6ec47cdd14 profile creation: when creating a profile, force all cookies to have a duration to avoid expiring session cookies (#139)
- save cookies on page load and also before profile creation
- default cookie duration is 7 days, configurable via --cookieDays option
2022-05-18 23:23:32 -07:00
Ilya Kreymer
93b6dad7b9 Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check

- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.

- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.

- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.

- S3 Storage refactor, simplify, don't add additional paths by default.

- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.

- wacz save: reenable wacz validation after save.

- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.

- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
Ilya Kreymer
500ed1f9a1 Profile Creation Improvements (#136)
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.

* bump to 0.6.0-beta.0
2022-05-05 14:27:17 -05:00
Ilya Kreymer
5dfbfbeaf6 update dependencies: (#134)
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for telegram behavior
- bump version to 0.5.1
0.5.1
2022-04-15 16:22:47 -07:00
Ilya Kreymer
9b938304ce dependencies: update to pywb>=2.6.6, wacz>=0.4.5 0.5.0 2022-04-11 15:09:59 -07:00
Ilya Kreymer
cc391146c4 package: set minio version to fixed (7.0.26) 2022-04-09 22:07:17 -07:00
Ilya Kreymer
bfd72835d1 update CHANGES for 0.5.0 release 2022-04-09 21:59:44 -07:00
Ilya Kreymer
7ed5586bdb scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity 0.5.0-beta.8 2022-03-22 17:43:13 -07:00
Ilya Kreymer
5afd19f43d Non-HTML Page Load Optimization (#130)
* non-html page load improvements: fix for #129
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
- don't do text extraction for non-HTML pages (will need to handle pdf separately)
bump to 0.5.0-beta.8
2022-03-22 17:41:51 -07:00