FOSSY 2025 | Presentation: Transliteration of Renaissance period Spanish Text.

Presented by

Shashank Shekhar Singh
@Shashankss1205
https://www.linkedin.com/in/shashank-shekhar-singh-121128254/

Shashank Shekhar Singh is an accomplished and driven individual with a strong passion for technology and innovation in Open Source Softwares. Currently he is pursuing his undergraduate course in his third year at IIT BHU in India. He is a Machine Learning Intern at American Express and currently leads the CoPS-IG (Programming Group) at IIT BHU. A selected contributor for Google Summer of Code 2024 with HumanAI, he is also a part of the LFX Mentorship Program under the Linux Foundation Networking and is contributing to libbitcoin as part of the Summer of Bitcoin 2024. Shashank gained industry experience as a Data Science Intern at Blue Yonder and showcased his technical prowess during the Inter IIT Tech Meet 13, collaborating with Adobe. As a core member of CoPS SDG, he consistently fosters collaborative learning and open-source contributions. With 7 hackathon victories under his belt, Shashank demonstrates a strong problem-solving mindset, leadership abilities, and a deep commitment to leveraging technology for real-world impact.

Abstract

Recent advances have made Artificial Intelligence/Machine Learning (AI/ML) processes increasingly relevant to business, healthcare, finance, retail, and telecommunications interests. The objective of the present study is to explore the potential to leverage modern machine learning algorithms directly into humanities research fields. To do so, the renAIssance project was created, aiming to examine various possibilities of using AI/ML algorithms for Optical Character Recognition (OCR) to accelerate and improve the accuracy of automatized transcription in digitized historical archival documents. This article considers the state of the field as it pertains to the main processes used in natural language processing. It also explores difficulties arising from salient features of early modern printing practices that diverge from modern typographical conventions. Four AI/ML approaches were employed to achieve context-rich processing of a specifically selected historical archival dataset: Convolutional Neural Networks, Sequence-to-Sequence Contrastive Learning (SeqCLR), Vision Transformers, and Transformer-based OCR (TrOCR). The archival corpus consisted of 931 pages, 2,082 folios, and 61 manual transcriptions as ground truth to train the algorithms. This study reports on the accuracy achieved by each of the four methods when transcribing and transliterating early modern documents. Finally, it offers suggestions for future implementations to apply AI/ML tools to the analysis of archival sources commonly handled by researchers on humanities fields such as literature and history.