Commit Graph

4 Commits

Author SHA1 Message Date
Ilya Kreymer
e0244391f1 update to new data model:
- hashes stored in separate crawl specific entries, h:<crawlid>
- wacz files stored in crawl specific list, c:<crawlid>:wacz
- hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set
- store filename, crawlId in related.requires list entries for each wacz
2025-12-11 10:43:57 -08:00
Ilya Kreymer
d620e21991 - track source index for each hash, so entry becomes '<source index> <date> <url>'
- entry for source index can contain the crawl id (or possibly wacz and crawl id)
- also store dependent sources in relation.requires in datapackage.json
- tests: update tests to check for relation.requires
2025-12-11 10:42:57 -08:00
Ilya Kreymer
81d7848a79 dedup indexing: strip hash prefix from digest, as cdx does not have it
tests: add index import + dedup crawl to ensure digests match fully
2025-12-11 10:42:57 -08:00
Ilya Kreymer
94ac058488 tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records 2025-12-11 10:42:57 -08:00