Peter Broadwell
Manager of AI Modeling and Inference in Research Data Services
Stanford University
Lindsay King
Head Librarian, Bowes Art & Architecture Library
Stanford University
Neural network artificial intelligence (AI) technologies capable of working with both images and text offer promising tools for improving access to library collections at scale. In particular, libraries increasingly must address the obligation to generate succinct “alt-text” descriptions of digital images, which often entails remediation tasks in the tens of thousands of items. AI approaches are appealing given their ability to automate complex tasks involving natural language, but there are plentiful reasons to look beyond simply pasting library materials into ChatGPT. Stanford University’s experiments have found that both fine-tuning of locally hosted models and “conditioning” of the captions by incorporating available metadata into the model’s instructions (“prompt engineering”) show promise for producing useful descriptive text for images. They’ve also found that tailoring approaches to specific collections and keeping human reviewers in the loop are keys to making the alt-text as accurate as possible while gaining efficiency at scale. Beyond accessibility compliance, vision language models can also enable free-text “evocative” search in multiple languages, object detection, and other tools for improving discovery within image collections.