A collaboration between the Connecticut Digital Archive, Greenhouse Studios, and the Computer Science Engineering department at the University of Connecticut, the Handwriting Text Recognition (HTR) project looks to create an open source text recognition model that will be able to transcribe handwritten documents in order to unlock the information contained in handwritten documents and manuscripts held in archives around the world.
Archives and special collections contain multitudes of documents that are handwritten from across the centuries. The historical information in these documents are inaccessible because they are either only available at the physical repository or as only images of manuscript pages. Developing a reliable and accurate tool to decipher and make the significant historical content of these documents available, which can be understood by both humans and computers, would expose these data held in these documents to researchers and students around the world and open avenues to perform research on these materials that are currently closed, including text analysis and data mining.A tool of this nature would also save the significant time, effort, and money spent hand keyed transcription these documents.
The goal of this project is to develop a foundation for a large-scale, open source software for handwriting recognition for historical documents by training a neural network to recognize handwriting of 19th century scribes. We aim to reach this goal by using a combination of machine learning, artificial intelligence techniques and software platforms, specifically utilizing Graphical Processing Unites (GPUs), to transform simple screenshots of characters from handwritten documents by slightly skewing each captured image a number of times to create a larger training set.
The first stage of our work was completed in the summer of 2019 by an undergraduate computer science student, who created a training set of over 16,000+ images of 22 different characters (or classes), averaging about 260 per class from 4 of volumes of the John Adams Papers. Through a partnership with the Massachusetts Historical Society, we have access to over 200 pages of handwritten material from John Quincy Adams from the early 19th century. Obtained through the team’s previous engagement with this extensive collection of key historical documents, these pages have served as the initial data source for our work. We chose this data set because of the regularity of the handwriting in these documents and the fact that they are already transcribed, enabling us to compare our automated work with the manually created transcripts.
In the Summer of 2020, the project was awarded a LYRASIS Catalyst grant to continue developing our model. We determined the best next step would be to further push the model and scale up to 70+ classes of characters to increase the data set by 3-5x and perform more tests. To create a larger data set, more work needed to be done reviewing the remaining 3 volumes of handwritten documents from the John Adams Papers to identify and capture screenshots of characters. The goal was to capture more images of the 22-character classes from the original work done in 2019 and identify new classes to further build up the training set for our model.
Progress and Findings
Materials and Links
Project GitHub (under construction)
Initial GitHub - https://github.com/mattlm0831/OCR-Handwriting
Image Capture Tool Code- https://github.com/bechardj/lc-tool
Image Capture Tool Site - https://lct.jbec.us/
Director, Digital Preservation Repository Program
Associate Professor In Residence, Associate Director of Undergraduate Programs in Computing
Associate Director, Greenhouse Studios