Jian Wu
Assistant Professor
Old Dominion University
Edward Fox
Professor
Virginia Polytechnic Institute and State University
Funded by the Institute of Museum and Library Services, Virginia Tech and Old Dominion University are collaborating on a project aimed at bringing computational access to book-length documents, and demonstrating that process with electronic theses and dissertations (ETDs). Since the project launch, the team has made substantial progress on various tasks, including data acquisition, information extraction, and classification. The team has collected the largest corpus of ETDs containing about 500,000 full-text documents and their metadata. The collection was made by actively crawling institutional ETD repositories of university libraries in the United States, honoring the crawling policies of target websites. To facilitate building robust text representations for downstream tasks, we investigated building a language model specific to ETDs. This model, called ETDBERT, was built by fine-tuning Bidirectional Encoder Representations from Transformers (BERT) using a corpus containing 300 million tokens extracted from a subset of ETDs we collected across 195 disciplines. ETDBERT was evaluated based on intrinsic and extrinsic metrics and demonstrated superior performance compared with traditional text representations on a subject domain classification task. Compared with SciBERT, which was trained on a single Tensor Processing Unit (TPU) for seven days, training ETDBERT uses far fewer resources while achieving comparable performance on a subject domain classification task. We attribute this to the multi-disciplinary sampling of our training corpus. Our planned further improved language model will help us even more with tasks, such as novelty measurement, automatic subject categorization, and long text summarization, to better understand the nuances of knowledge in ETDs, and to provide robust and scalable related services.