There’s a very interesting new report out on Journal Data Mining; it was prepared by Eefke Smit and Maurits van der Graaf on behalf of the Publishing Research Consortium, so it has a strong publisher perspective, but as far as I know it’s the first extensive look at the issues involved in practical and operational large-scale data mining of the journal literature. One of the really interesting things that emerges from the report, at least the way I read it, is that many of the commercial publishers seem to be thinking about literature mining as a separate activity, not included in traditional electronic subscription arrangements (site licenses) that they have with research libraries. (Indeed, many such licenses forbid bulk downloading of journal articles, which in the absence of text mining facilities built into the vendor platforms is a prerequisite for such mining; even if such facilities exist, they essentially mean that the publishers control the evolution of mining technology). Rather, the publishers seem to envision a future where they’ll do business directly with potential literature miners.
This is one of several issues framed by the report which I think merit very careful thought by research library leaders, and broad conversations engaging faculty.
The report is at:
http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf
and there is an accompanying press release at
http://www.publishingresearch.net/Media_page.htm
Disclosure: I was one of the many people interviewed for this study, presumably at least in part because of my 2006 paper on open computation.
Clifford Lynch
Director, CNI