Ryan Cordell
Associate Professor of English
Northeastern University
This talk will outline the primary findings and recommendations of a report written for The Andrew W. Mellon Foundation that seeks to describe the current state of optical character recognition (OCR) for large-scale humanities collections and suggest the most fruitful avenues for future research in this domain. The report surveys the current state of OCR for historical documents and recommends concrete steps that researchers, implementers, and funders can take to make progress improving the quality and use of OCR collections over the next five to ten years. We find, for instance, that advances in artificial intelligence for image recognition, natural language processing, and machine learning will drive significant progress in this area. More importantly, however, we describe how sharing goals, techniques, and data among researchers in computer science, in book and manuscript studies, and in library and information sciences will open up exciting new problems and allow a broad community, including cohorts who rarely collaborate, to allocate resources and measure progress in improving OCR for historical typography and multilingual documents. This presentation will briefly outline the report’s findings about the current state of the art for humanistic OCR, but will devote the majority of his talk to detailing the report’s nine primary recommendations for future, collaborative OCR research.