The availability of large digitized historical corpora presents myriad challenges and opportunities. This talk will present an overview of some of the key problems presented by historical corpora, including the need for text preprocessing (correcting transcription errors, punctuating raw text, opening abbreviations, and morphological tagging), authorship analysis, identifying cross-references and parallel texts, and stemma and ur-text reconstruction. We will focus in depth on two especially interesting challenges: source analysis of multi-author documents (including Hebrew biblical books) and ur-text reconstruction from multiple noisy textual witnesses.
Moshe Koppel is a professor of computer science at Bar-Ilan University in Israel and chief scientist of the DICTA, the aim of which is to apply methods of computational linguistics to a large historical corpus of Hebrew and Jewish literature. Much of his research has focused on text-related applications of machine learning, especially authorship attribution. He has published academic papers in leading journals in computer science, mathematics, linguistics, economics, law, political science and other disciplines.