Skip to main content

View Post [edit]

Poster: aibek Date: Jan 30, 2014 7:27pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

The digest isn’t the SHA-1 hash. For the above linked Google logo gif, SHA-1 hash is fd852df5478eb7eb9410ee9101bb364adf487fb0. None of the digests recorded on the CDX page is this. There must be ways to get the original unmodified page. (I don’t know any, though.) Contrawise, (i) the Wayback Machine almost certainly modifies the links on-the-fly, and (ii) the webpages saved are expected to differ in all imaginable and unimaginable ways. Therefore the code doing the modification must be trivial, as it is supposed to work on every webpage saved without breaking it. I am willing to bet that it does nothing more than inserting a few lines, and changing all the urls to prefix web.archive.org/web/TIMESTAMP. As such, after comparing a few pages, once you have identified what exactly is being changed, you could be certain -- for all practical purposes -- that by doing the opposite (i.e., deleting the two additions), you are getting the original pages back. (In case you go this route, please post what you learn on the forum too, as it would be useful to others.) Also, note that you may not have to check the digests for all the saved pages. If you are willing to assume that same digest imply same page (even though we already know that same page does-not-imply same digest), you could proceed with the CDX collapse-based-on-digest result! (i.e., exactly where you started) That is to say, the CDX collapse result would have already removed the adjacent same digest records; you could use this list as your master list.
This post was modified by aibek on 2014-01-31 03:27:04

Reply [edit]

Poster: Aleitheia Date: Aug 1, 2022 5:58pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

Pardon me for exhuming this very interesting discussion on a subject I've recently been looking into... some random observations/ideas, would like to hear others thougths on this:

1. As far as I have been able to find out, the CDX digest is supposed to be a base32 representation of the SHA1 hash of something called the "response payload" of a record in a ARC/WARC file (see http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/). The SHA1 of the google logo gif listed above (fd852df5478eb7eb9410ee9101bb364adf487fb0) is the hex representation, so it's no surprise it does not match to any of the digests. However, the base32 representation of the hash is zp2jvxa7htvyq50gxt8g3esp9bfmgzxg and this also doesn't match, so the "response payload" must include something else besides the gif file alone...
2. Whatever this "something else" may be, it cannot be the header, because otherwise there would be much more variation caused by the changing dates and timestamps, However, when comparing the headers of the different snapshots of the google gif, I noticed that the more common 7WCS35KHR236XFAQ52IQDOZWJLPUQ75Q digests list a lot of different files/paths as x-archive-src while the much rarer captures that list ISYUH57EGD664SU7S4HQK77WWRXD4M73 as digest always seem to come from an x-archive-src which start with wb_urls.ia* (like wb_urls.ia14226.20050306173129-c/wb_urls.ia14226.20050414052232.arc.gz) - so the irregularity seems to be caused by the software that is used for creating/processing these archive files.
3. At the webrecorder github there's an interesting discussion on the subject of inconsistencies caused by WARC payload digest calculation with or without HTTP transfer encoding (https://github.com/webrecorder/warcio/issues/74) - maybe something similar has been in play here, causing the discrepancies in the wayback machine's CDX digests?

P.S: To get unaltered content from wayback machine, there's a nice undocumented feature mentioned at https://wiki.archiveteam.org/index.php/Restoring: simply add "id_" after the timestamp in the url!
This post was modified by Aleitheia on 2022-08-02 00:58:13