Skip to main content

View Post [edit]

Poster: Zarkoff Date: Jan 24, 2014 2:13am
Forum: forums Subject: CDX digest not accurately capturing duplicates?

I'm having trouble using a CDX query to identify duplicate pages over a long time period.

It suggests to me that I cannot rely on the CDX digest to do what it seems to be designed to do.

I have been running a query to deliver fields including the digest hash. I've then been using the digest to collapse all duplicates and therefore identify only distinct pages at a given url over the period.

But the CDX query is giving different digests for identical pages at identical urls on different dates. It is even giving different digests for blank pages - when for some reason the page is blank in the archive (i.e. no error, no nothing).

Can this be right? Have I misunderstood what the digest is and what it does?

Does this also mean that the CDX API's own collapse function is unreliable at removing adjacent duplicates?

And does it mean I will have to download every instance of a page and hash it myself to remove duplicates?

Reply [edit]

Poster: aibek Date: Jan 25, 2014 2:31am
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

You are right. Wayback Machine server records more than one digest for the same file.

The gif image queried for has apparently stayed the same for the last 13 years. On the linked CDX query page, the digests for almost all the records are the same (7WCS…). But some records have another digest (ISYU…). The latter files, however, are exactly the same as the former files.

Either there is a bug (the bug, however, works in a consistent manner!), or Zarkoff and I have misunderstood what the digest represents.

http://web.archive.org/cdx/search/cdx?url=google.com/intl/en/images/logo.gif&collapse=timestamp:10

Reply [edit]

Poster: aibek Date: Jan 25, 2014 2:59am
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

> And does it mean I will have to download every instance of a page and hash it myself to remove duplicates? As you may have guessed, downloading all instances of a webpage, and hashing them yourself, would be worse than relying on the CDX digest. That is because all the instances of the webpage are guaranteed to be different, because the Wayback Machine replaces all links by internal hyperlinks. These urls contain timestamps, and the timestamps obviously differ. You could however try identifying all these internal links, and delete them, before computing the hash. Perhaps simply deleting all the http://web.archive.org/web/TIMESTAMP/ parts of all urls would do.
This post was modified by aibek on 2014-01-25 10:59:00

Reply [edit]

Poster: aibek Date: Jan 24, 2014 9:20pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

> Does this also mean that the CDX API's own collapse function is unreliable at removing adjacent duplicates?

According to your report, it is not the CDX Server which is at fault, but the fact that different digests are being created for the same page.

Why the same pages get different digests is an interesting question! (if true; I will check.) Two observations though,
(i) I could not find how the digests are being computed by the Wayback Machine, when I last looked into the issue. It does not seem to be a straightforward computation of digest of the files downloaded. At least, it is not a straightforward computation using 20 or so most popular algorithms. (md5, sha1, etc)
(ii) I am not sure how the webpages are saved by the Wayback Machine, but I assume that only one copy per digest is being saved. (to save on storage space) If that assumption is true, and if it is true that different digests are being created for exactly the same content, then it is a major bug, as terabytes of space on the IA server is being wasted.

Reply [edit]

Poster: Zarkoff Date: Jan 29, 2014 9:06am
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

Such relief to hear from someone interested in the same questions, aibek.

The CDX digest is an SHA-1 hash according to this:

http://crawler.archive.org/apidocs/constant-values.html#org.archive.io.ArchiveFileConstants.CDX

Your suggestions for removing wayback alterations to archive pages are very useful. I wonder if the Wayback Machine provides a means of querying for pages to be delivered with only their original attributes?

Thank you for confirming the apparent error. The CDX documentation doesn't say what the digest is explicitly. Though it looks very much like a unique identifier on the basis that the CDX server to uses it to collapse adjacent duplicates, and that it is documented as an SHA-1 hash.

Reply [edit]

Poster: aibek Date: Jan 30, 2014 7:27pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

The digest isn’t the SHA-1 hash. For the above linked Google logo gif, SHA-1 hash is fd852df5478eb7eb9410ee9101bb364adf487fb0. None of the digests recorded on the CDX page is this. There must be ways to get the original unmodified page. (I don’t know any, though.) Contrawise, (i) the Wayback Machine almost certainly modifies the links on-the-fly, and (ii) the webpages saved are expected to differ in all imaginable and unimaginable ways. Therefore the code doing the modification must be trivial, as it is supposed to work on every webpage saved without breaking it. I am willing to bet that it does nothing more than inserting a few lines, and changing all the urls to prefix web.archive.org/web/TIMESTAMP. As such, after comparing a few pages, once you have identified what exactly is being changed, you could be certain -- for all practical purposes -- that by doing the opposite (i.e., deleting the two additions), you are getting the original pages back. (In case you go this route, please post what you learn on the forum too, as it would be useful to others.) Also, note that you may not have to check the digests for all the saved pages. If you are willing to assume that same digest imply same page (even though we already know that same page does-not-imply same digest), you could proceed with the CDX collapse-based-on-digest result! (i.e., exactly where you started) That is to say, the CDX collapse result would have already removed the adjacent same digest records; you could use this list as your master list.
This post was modified by aibek on 2014-01-31 03:27:04

Reply [edit]

Poster: Aleitheia Date: Aug 1, 2022 5:58pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

Pardon me for exhuming this very interesting discussion on a subject I've recently been looking into... some random observations/ideas, would like to hear others thougths on this:

1. As far as I have been able to find out, the CDX digest is supposed to be a base32 representation of the SHA1 hash of something called the "response payload" of a record in a ARC/WARC file (see http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/). The SHA1 of the google logo gif listed above (fd852df5478eb7eb9410ee9101bb364adf487fb0) is the hex representation, so it's no surprise it does not match to any of the digests. However, the base32 representation of the hash is zp2jvxa7htvyq50gxt8g3esp9bfmgzxg and this also doesn't match, so the "response payload" must include something else besides the gif file alone...
2. Whatever this "something else" may be, it cannot be the header, because otherwise there would be much more variation caused by the changing dates and timestamps, However, when comparing the headers of the different snapshots of the google gif, I noticed that the more common 7WCS35KHR236XFAQ52IQDOZWJLPUQ75Q digests list a lot of different files/paths as x-archive-src while the much rarer captures that list ISYUH57EGD664SU7S4HQK77WWRXD4M73 as digest always seem to come from an x-archive-src which start with wb_urls.ia* (like wb_urls.ia14226.20050306173129-c/wb_urls.ia14226.20050414052232.arc.gz) - so the irregularity seems to be caused by the software that is used for creating/processing these archive files.
3. At the webrecorder github there's an interesting discussion on the subject of inconsistencies caused by WARC payload digest calculation with or without HTTP transfer encoding (https://github.com/webrecorder/warcio/issues/74) - maybe something similar has been in play here, causing the discrepancies in the wayback machine's CDX digests?

P.S: To get unaltered content from wayback machine, there's a nice undocumented feature mentioned at https://wiki.archiveteam.org/index.php/Restoring: simply add "id_" after the timestamp in the url!
This post was modified by Aleitheia on 2022-08-02 00:58:13