97 - Automated Analysis Of Historical Printed Documents, With Taylor Berg-Kirkpatrick

In this episode, we talk to Taylor Berg-Kirkpatri…

44 Minuten

40.45 MB

Podcast

Podcaster

NLP Highlights

**The podcast is currently on hiatus. For more ac…

Wissenschaft

Beschreibung

vor 6 Jahren

In this episode, we talk to Taylor Berg-Kirkpatrick about optical
character recognition (OCR) on historical documents. Taylor starts
off by describing some practical issues related to old scanning
processes of documents that make performing OCR on them a difficult
problem. Then he explains how one can build latent variable models
for this data using unsupervised methods, the relative importance
of various modeling choices, and summarizes how well the models do.
We then take a higher level view of historical OCR as a Machine
Learning problem, and discuss how it is different from other ML
problems in terms of the tradeoff between learning from data and
imposing constraints based on prior knowledge of the underlying
process. Finally, Taylor talks about the applications of this
research, and how these predictions can be of interest to
historians studying the original texts.