Michael L. Nelson
Professor of Computer Science
Old Dominion University
As the number of public web archives grows, so does our interest in verifying the integrity of archived web pages replayed from the archive. When web archives disagree when replaying a web page, we are unsure how to resolve the discrepancy. Adopting Segal’s law to web archives: “The person with an archive knows what the page looked like. The person with two archives is never sure.” At first glance, a distributed public ledger such as blockchain would seem like a good solution to detect damage or tampering of web pages: web pages could be replayed by third parties and their cryptographic hash and time stamps stored in the blockchain. However, we have found over the course of one year through continuously replaying over 17,000 web pages sampled from 20 different public web archives that approximately 75% of the replayed web pages have undergone some kind of change that would cause them to not hash to the same value. Some changes are significant, impacting the semantics of the page itself, but most changes would not be noticed by regular users. Nonetheless, if blockchain or other hash-based values techniques were used to detect tampering, the number of false positives generated by the normal operation of web archives would make detecting actual tampering almost impossible. We review the different kinds of changes with examples drawn from each of the 20 public web archives.