Kris Carpenter Negulescu Director, Web Archive Internet Archive |
|
For the past two years, the Internet Archive (IA) has used Nutch/Lucene open source tools to generate full text search indexes of archival Web content for the National Library of Australia, the Bibliothèque Nationale de France, the Library of Congress, and the National Archives and Records Admininstration (NARA). The IA has also produced an experimental search service for consumers of its own historic Web collections, The 20th Century Find, which encompasses content harvested from 1996-1999.
This presentation will review each of these case studies, including lessons learned to date searching Web archives at a scale of 100 million – 1+ billion URLs, and the challenges associated with searching across time, both technically as well as those specific to an end user’s experience of a Web archive. The case studies presented are specific to Nutch/Lucene implementations, but implications for searching archives in general will be the primary focus of this presentation.
http://www.archive.org/
http://lucene.apache.org/nutch/