Ronald Snyder
Director of Advanced Technologies
Ithaka – JSTOR
In the 15 years that JSTOR has been in existence, a wealth of logging data has been generated and archived. This logging data represents many billions of user actions. Until recently, this usage data has mainly been used for generating summary-level institution and publisher reports. The sheer volume and complexity of these data made multi-dimensioned, longitudinal analysis impractical until just recently. Over the last year, Ithaka has made a significant investment in normalizing and organizing these data in the interest of better understanding user behaviors and trends in the consumption of academic materials.
This presentation will include discussion of the technological approach that Ithaka has taken in dealing with the data volume and complexity issues, including Big Data challenges such as storage, processing, and analysis. Some experiences from the original attempt to build this data warehouse using traditional relational database technologies and the decision to abandon this approach in favor of a solution based on the open source Hadoop infrastructure will be shared. Hadoop provides a robust, scalable and cost-effective solution to managing Ithaka’s big data. Ithaka has combined Hadoop with an open source indexing technology (Lucene/SOLR) and some custom-built software providing a Web-based tool for the interactive exploration of this rich data set. The presentation will also include some top-level observations on user behaviors and content discovery and consumption trends that have been identified using these tools.
Handout (PDF)