Commit Graph

  • b80f892e0f error handling: - skip invalid wacz files provided for import - skip invalid multi-wacz json files provided for import - tests: add invalid multi-wacz file for testing hash-based-dedup Ilya Kreymer 2025-12-20 12:14:33 -08:00
  • 2c8d22f76a track size of page resources: - add 'size' entry to each resource in urn:pageinfo records - add 'size' entry to pages in pages.jsonl, set to sum of the size of all resources listed in urn:pageinfo record add-page-size-info Ilya Kreymer 2025-12-20 11:10:08 -08:00
  • 1860b97d05 tests: add test for import from json Ilya Kreymer 2025-12-20 10:04:43 -08:00
  • c73e9ce30c remove extra sleep Ilya Kreymer 2025-12-19 22:18:01 -08:00
  • 4792aefc51 always commit Ilya Kreymer 2025-12-19 22:01:25 -08:00
  • ea866db738 add logging Ilya Kreymer 2025-12-19 21:40:15 -08:00
  • b090df7f74 fix getHashDupe, use all key Ilya Kreymer 2025-12-19 21:28:52 -08:00
  • b30a35604c indexer: ensure indexer size is number Ilya Kreymer 2025-12-19 21:13:26 -08:00
  • d3a7290d8c include size in hash key data add hash dupe when WARC record actually written store savedSize as diff between original and revisit WARC records indexer: compute savedSize by tracking subtracing revisit records to be added, if revisit added before original Ilya Kreymer 2025-12-19 16:12:30 -08:00
  • 17076f1c37 Deployed 0ecaa38 with MkDocs version: 1.6.1 gh-pages 2025-12-17 00:29:03 +00:00
  • 0ecaa38e68 Fix custom behavior class example in docs (#940) main Tessa Walsh 2025-12-16 19:26:51 -05:00
  • e320908e6a don't fail crawl if profile can not be saved (#939) v1.10.2 Ilya Kreymer 2025-12-15 12:18:55 -08:00
  • 40983f1670 add urlNormalize to addHashDupe Ilya Kreymer 2025-12-11 10:46:23 -08:00
  • f00d791e1b fix size count typo, unique == not dupe! Ilya Kreymer 2025-12-11 10:37:53 -08:00
  • 1eba37aea7 don't commit to all if will be purged anyway Ilya Kreymer 2025-12-10 23:50:56 -08:00
  • 60c9b7d5d0 update purging of crawls to readd/recommit from added crawls, instead of removing hashes from removed crawls, as hashes may be present in other crawls remove crawl-specific keys for removed crawls Ilya Kreymer 2025-12-10 19:01:37 -08:00
  • 1a8fa632dd uniq -> unique add 'removable' count for number of crawls that can be removed from the index Ilya Kreymer 2025-12-10 15:18:59 -08:00
  • 36d0020354 stats: - compute totalUrls, totalSize, uniqSize (uniqUrls = number of hashes) in per crawl key - add stats on crawl commit, remove on crawl remove - tests: update tests to check stats Ilya Kreymer 2025-12-10 12:40:44 -08:00
  • f68175f74a don't include current crawl as self-reference dependency Ilya Kreymer 2025-12-09 16:20:19 -08:00
  • aa44e5491c cleanup pass: - support dedupe without requiring wacz, no crawl dependency tracking stored - add dedupe test w/o wacz - cleanup dedupe related naming Ilya Kreymer 2025-11-28 01:16:58 -08:00
  • c401c4871e generate wacz filename if deduping Ilya Kreymer 2025-11-27 23:40:02 -08:00
  • 460badf8c7 add removing option to also remove unused crawls if doing a full sync, disable by default Ilya Kreymer 2025-10-25 15:41:31 -07:00
  • 7c9317e3dc indexer optimize: commit only if added Ilya Kreymer 2025-10-25 13:17:01 -07:00
  • 7a5b3b2c18 rename 'dedup' -> 'dedupe' for consistency Ilya Kreymer 2025-10-25 09:33:37 -07:00
  • cb9367460f always return wacz, store wacz depends only for current wacz store crawlid depends for entire crawl Ilya Kreymer 2025-10-24 15:01:00 -07:00
  • b5157ae3b5 cleanup, keep compatibility with redis 6 still set to 'post-crawl' state after uploading Ilya Kreymer 2025-10-24 13:24:53 -07:00
  • e0244391f1 update to new data model: - hashes stored in separate crawl specific entries, h:<crawlid> - wacz files stored in crawl specific list, c:<crawlid>:wacz - hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set - store filename, crawlId in related.requires list entries for each wacz Ilya Kreymer 2025-10-24 10:38:36 -07:00
  • d620e21991 - track source index for each hash, so entry becomes '<source index> <date> <url>' - entry for source index can contain the crawl id (or possibly wacz and crawl id) - also store dependent sources in relation.requires in datapackage.json - tests: update tests to check for relation.requires Ilya Kreymer 2025-10-17 18:08:38 -07:00
  • dc04923c49 dedup post requests and non-404s as well! update timestamp after import Ilya Kreymer 2025-09-25 10:40:57 -07:00
  • 76737b72fd use dedup redis for queue up wacz files that need to be updated use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries Ilya Kreymer 2025-09-22 22:30:08 -07:00
  • 81d7848a79 dedup indexing: strip hash prefix from digest, as cdx does not have it tests: add index import + dedup crawl to ensure digests match fully Ilya Kreymer 2025-09-22 17:46:19 -07:00
  • 7e553b6a87 deps update Ilya Kreymer 2025-09-19 20:54:52 -07:00
  • 94ac058488 tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records Ilya Kreymer 2025-09-18 13:10:53 -07:00
  • 5c02c0a18c bump to 2.4.7 Ilya Kreymer 2025-09-18 12:17:33 -07:00
  • 77dff861b7 update to latest warcio (2.4.7) to fix issus when returning payload only size Ilya Kreymer 2025-09-18 02:04:28 -07:00
  • 3995629e0d rename --dedupStoreUrl -> redisDedupUrl bump version to 1.9.0 fix typo Ilya Kreymer 2025-09-17 23:36:25 -07:00
  • 60ff421782 warc writing: - update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response Ilya Kreymer 2025-09-17 20:48:32 -07:00
  • af0c0701b1 keep skipping dupe URLs as before Ilya Kreymer 2025-09-17 20:02:01 -07:00
  • aa8a189c0f add indexer entrypoint: - populate dedup index from remote wacz/multi wacz/multiwacz json Ilya Kreymer 2025-09-17 19:23:32 -07:00
  • f80fded455 args: add separate --dedupIndexUrl to support separate redis for dedup indexing prep: - move WACZLoader to wacz for reuse Ilya Kreymer 2025-09-16 17:48:13 -07:00
  • 94d9a1ea33 dedup work: - resource dedup via page digest - page dedup via page digest check, blocking of dupe page Ilya Kreymer 2025-08-30 12:41:10 -07:00
  • df26169975 Sitemaps: parse /sitemap.xml if no sitemap listed in robots.txt (#933) v1.10.1 Ilya Kreymer 2025-12-11 10:37:37 -08:00
  • 850a6a6665 Don't remove excluded-on-redirect URLs from seen list (#936) Ilya Kreymer 2025-12-08 22:41:52 -08:00
  • 4a703cdc09 sort query args before queuing URLs (#935) Ilya Kreymer 2025-12-08 15:51:50 -08:00
  • 993081d3ee better handling of net::ERR_HTTP_RESPONSE_CODE_FAILURE: (#934) Ilya Kreymer 2025-12-05 16:56:42 -08:00
  • aff3179a3a Merge branch 'add-normalize-url' into temp-dev temp-dev Ilya Kreymer 2025-12-05 09:50:40 -08:00
  • 826342f001 change opts for normalization, such as keeping www. and trailing slashes Ilya Kreymer 2025-12-05 09:50:13 -08:00
  • f367f4d31e Merge branch 'sitemap-not-listed-in-robots-fix' into temp-dev Ilya Kreymer 2025-12-05 09:30:34 -08:00
  • c91ccc5148 use normalizeUrl to avoid differently sorted query args Ilya Kreymer 2025-12-05 09:29:41 -08:00
  • 805e2dceaa better handling of net::ERR_HTTP_RESPONSE_CODE_FAILURE: - http headers provided but no payload, record response - record page as failed with status code provided, don't attempt to retry Ilya Kreymer 2025-12-05 09:08:21 -08:00
  • 4c1ee2d2e4 additional logging, resolve relative sitemap urls, eg. '/sitemap.xml' in robots.txt Ilya Kreymer 2025-12-04 16:03:08 -08:00
  • 42883b1da8 simply sitemap detection logic: - robots.txt and sitemap.xml exist, but no sitemap listed in robots, still parse sitemap.xml - simplify detection logic to be able to check both robots and sitemap, or queue custom url Ilya Kreymer 2025-12-04 15:00:54 -08:00
  • 822de93301 version: bump to 1.10.0 v1.10.0 Ilya Kreymer 2025-12-03 14:56:02 -08:00
  • 042acc9c39 version: bump to 1.10.0.beta-2 v1.10.0-beta.2 Ilya Kreymer 2025-12-02 17:00:41 -08:00
  • ff5619e624 Rename robots flag to --useRobots, keep --robots as alias (#932) Tessa Walsh 2025-12-02 18:55:25 -05:00
  • 2914e93152 sitemapper refactor to fix concurrency: (#930) Ilya Kreymer 2025-12-02 15:52:33 -08:00
  • 59df6bbd3f crash page on prompt dialog loop to continue: (#929) Ilya Kreymer 2025-12-01 16:57:00 -08:00
  • 9db0872ecc rebase fix hash-dupe-rebased Ilya Kreymer 2025-11-27 22:41:34 -08:00
  • 7c37672ae9 add removing option to also remove unused crawls if doing a full sync, disable by default Ilya Kreymer 2025-10-25 15:41:31 -07:00
  • 0d414f72f1 indexer optimize: commit only if added Ilya Kreymer 2025-10-25 13:17:01 -07:00
  • dd8d2e1ea7 rename 'dedup' -> 'dedupe' for consistency Ilya Kreymer 2025-10-25 09:33:37 -07:00
  • c4f07c4e59 always return wacz, store wacz depends only for current wacz store crawlid depends for entire crawl Ilya Kreymer 2025-10-24 15:01:00 -07:00
  • 9fba5da0ce cleanup, keep compatibility with redis 6 still set to 'post-crawl' state after uploading Ilya Kreymer 2025-10-24 13:24:53 -07:00
  • 6579b2dc95 update to new data model: - hashes stored in separate crawl specific entries, h:<crawlid> - wacz files stored in crawl specific list, c:<crawlid>:wacz - hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set - store filename, crawlId in related.requires list entries for each wacz Ilya Kreymer 2025-10-24 10:38:36 -07:00
  • 298b901558 - track source index for each hash, so entry becomes '<source index> <date> <url>' - entry for source index can contain the crawl id (or possibly wacz and crawl id) - also store dependent sources in relation.requires in datapackage.json - tests: update tests to check for relation.requires Ilya Kreymer 2025-10-17 18:08:38 -07:00
  • 8d53399455 dedup post requests and non-404s as well! update timestamp after import Ilya Kreymer 2025-09-25 10:40:57 -07:00
  • 78b8847323 use dedup redis for queue up wacz files that need to be updated use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries Ilya Kreymer 2025-09-22 22:30:08 -07:00
  • ca02f09b5d dedup indexing: strip hash prefix from digest, as cdx does not have it tests: add index import + dedup crawl to ensure digests match fully Ilya Kreymer 2025-09-22 17:46:19 -07:00
  • db4393c2a1 deps update Ilya Kreymer 2025-09-19 20:54:52 -07:00
  • 0cadf371d0 tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records Ilya Kreymer 2025-09-18 13:10:53 -07:00
  • c447428450 bump to 2.4.7 Ilya Kreymer 2025-09-18 12:17:33 -07:00
  • 2f81798f09 update to latest warcio (2.4.7) to fix issus when returning payload only size Ilya Kreymer 2025-09-18 02:04:28 -07:00
  • db9e78e823 rename --dedupStoreUrl -> redisDedupUrl bump version to 1.9.0 fix typo Ilya Kreymer 2025-09-17 23:36:25 -07:00
  • bbe084daa0 warc writing: - update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response Ilya Kreymer 2025-09-17 20:48:32 -07:00
  • 87c94876f6 keep skipping dupe URLs as before Ilya Kreymer 2025-09-17 20:02:01 -07:00
  • 2ecf290d38 add indexer entrypoint: - populate dedup index from remote wacz/multi wacz/multiwacz json Ilya Kreymer 2025-09-17 19:23:32 -07:00
  • eb6b87fbaf args: add separate --dedupIndexUrl to support separate redis for dedup indexing prep: - move WACZLoader to wacz for reuse Ilya Kreymer 2025-09-16 17:48:13 -07:00
  • 00eca5329d dedup work: - resource dedup via page digest - page dedup via page digest check, blocking of dupe page Ilya Kreymer 2025-08-30 12:41:10 -07:00
  • 8e44b31b45 version: bump to 1.10.0-beta.1 v1.10.0-beta.1 Ilya Kreymer 2025-11-27 22:25:11 -08:00
  • 5bb4527de2 (backport for 1.9.3 release) fix connection leaks in aborted fetch() requests (#924) (#925) v1.9.3 1.9.3-release Ilya Kreymer 2025-11-27 21:00:24 -08:00
  • 6a163ddc47 version: 1.9.3 Ilya Kreymer 2025-11-27 20:41:27 -08:00
  • 2ef8e00268 fix connection leaks in aborted fetch() requests (#924) Ilya Kreymer 2025-11-27 20:37:24 -08:00
  • 081272a3f6 robots tweaks: - if redirected to a different site's /robots.txt, cache entry for that site also - deps: bump to wabac.js 2.25.0 robots-cache-redirect Ilya Kreymer 2025-11-27 14:59:25 -08:00
  • 8658df3999 deps: update to browsertrix-behaviors 0.9.7, puppeteer-core 24.31.0 (#922) v1.10.0-beta.0 Ilya Kreymer 2025-11-26 20:12:16 -08:00
  • 30646ca7ba Add downloads dir to cache external dependency within the crawl (#921) Ilya Kreymer 2025-11-26 19:30:27 -08:00
  • 1d15a155f2 Add option to respect robots.txt disallows (#888) Tessa Walsh 2025-11-26 22:00:06 -05:00
  • 75a0c9a305 version: bump to 1.10.0-beta.0 Ilya Kreymer 2025-11-26 15:15:45 -08:00
  • 9cd2d393bc Fix typo 'runInIframes' (#918) hexagonwin 2025-11-26 12:19:01 +09:00
  • b9b804e660 improvements to support pausing: (#919) Ilya Kreymer 2025-11-25 19:17:39 -08:00
  • 8595bcebc1 add new logger.interrupt() which will interrupt and exit crawl but not fail unlike logger.fatal() replace some logger.fatal() with interrupts to allow for retries instead of immediate failure, esp. when external inputs (profile, behaviors) can not be downloaded logger-interrupt Ilya Kreymer 2025-11-25 07:58:30 -08:00
  • de254064f8 update tiktok-better-captcha-check Ilya Kreymer 2025-11-21 14:05:14 -08:00
  • 764b67b0b3 update Ilya Kreymer 2025-11-21 13:45:59 -08:00
  • 791afa3413 update behaviors Ilya Kreymer 2025-11-21 13:20:37 -08:00
  • dc60b3dccd bump profile download timeout Ilya Kreymer 2025-11-20 14:41:01 -08:00
  • b74147a1ec update Ilya Kreymer 2025-11-20 13:19:13 -08:00
  • b98aa8aea1 update behaviors Ilya Kreymer 2025-11-20 13:04:52 -08:00
  • 510f81ad45 Update SaveState type issue-897-seedfile-expiration Tessa Walsh 2025-11-19 19:15:28 -05:00
  • 565ba54454 better failure detection, allow update support for captcha detection via behaviors (#917) v1.9.2 Ilya Kreymer 2025-11-19 15:49:49 -08:00
  • 0c7e2ce37e Tweak test, order in finished doesn't matter Tessa Walsh 2025-11-19 18:15:13 -05:00
  • 3b87f11286 Make sure seed file isn't re-downloaded in test Tessa Walsh 2025-11-19 18:04:23 -05:00