Computerized Tools for Navigating Tibetan Literary Corpora
5 March 2014

Photo: Orna Almogi
March 5, 2014, Beit Maimon, Tzahal St. 4, Zichron Yaakov, Israel
With the increasing access to old Tibetan manuscripts and xylographs in recent decades, numerous new research opportunities have been opened up to scholars of Tibetan textual studies. The opportunities offered by these new discoveries and developments, however, also bring new challenges, particularly with respect to the ability to effectively navigate through the enormous amount of material. Moreover, while until recently the main focus of scholarly attention and interest has been the content of this material, currently the focus is increasingly being shifted to the codicological, paleographical, and material aspects of these documents. The present workshop thus aims to introduce and discuss several computerized tools that could aid navigation through literary corpora, in general, and Tibetan corpora, in particular. While some of the tools offer navigation solutions for textual material extant in transcribed or transliterated electronic forms, others attempt at doing so for material available only in the form of scanned images.
Programme
Nachum Dershowitz |
Approximate Text Alignment and Machine-Aided Lexicography The growing amount of transcribed and transliterated Tibetan texts enables scholars to search the contents of textual material on a scale greater than ever before. However, it also opens up possibilities of examining and comparing texts in more sophisticated manners, which would allow one to study various other aspects of textual material. These include locating shared passages (i.e. quotations or “borrowed” textual material) in two or more texts and tracing the history of evolution of texts. I shall present preliminary results of our efforts to develop methods of locating shared text passages within Tibetan Buddhist corpora transcribed in Latin script. In this exploratory work, we compared two major Buddhist texts, in their Tibetan translation, the Sūtrasamuccaya and the Śikṣāsamuccaya. We borrowed an algorithm designed for finding all (sufficiently long) approximate subsequence matches in genomic data and adapted it to finding common approximate subtexts between two large corpora. This involved parallelizing the algorithm, adding some simple preprocessing, and some less-trivial post-processing. In addition, I propose to discuss several ways in which computers can aid in the construction of lexicons and ontologies. In particular, alignment algorithms can be used to match words in Tibetan with corresponding words or phrases in a parallel Sanskrit text. |
Lior Wolf |
Automatic Scribal Analysis of Old Manuscripts The increasing amount of scanned images of Tibetan manuscripts and xylographs, on the one hand, and the growing interest in paleographical and codicological aspects on the part of scholars of Tibetan Studies, on the other, brought about the need to explore new methods for examining the huge bulk of material, with the aim of proposing a scheme for systematized Tibetan paleography and codicology. I shall describe our plans to apply the same methods to Tibetan manuscripts as have been successful in our recent work with the Cairo Genizah. (The Genizah is a collection of handwritten documents containing some 350,000 fragments discovered in Cairo in the late 19th century, which are today spread out in more than seventy collections worldwide.) Using computer-vision and machine-learning algorithms, we have been able to automatically classify Genizah manuscripts by script style and to identify hundreds of new “joins”, that is, matches between leaves in the same hand and originally part of the same manuscript, but now catalogued separately. I shall present initial experiments that were conducted with the first 30 volumes of the bKa’ gdams gsung ‘bum collection and show that the same software is able to accurately match manuscripts that were written in the same script and subtypes of scripts. |
Tal Hassner |
Aligning Transcripts and Images without Performing OCR Recent large-scale digitization and preservation efforts have made images of original manuscripts, accompanied by transcripts, commonly available. An important challenge, for which no practical system exists, is that of aligning transcript letters to their coordinates in manuscript images. We propose a system that directly matches the image of a historical text with a synthetic image created from the transcript for the purpose. This, rather than attempting to recognize individual letters in the manuscript image using optical character recognition (OCR). Our method matches the pixels of the two images by employing a dedicated dense-flow mechanism coupled with novel local image descriptors designed to spatially integrate local patch similarities. Matching these pixel representations is performed using a message-passing algorithm. The various stages of our method make it robust with respect to document degradation, to variations between script styles and to non-linear image transformations. Robustness, as well as practicality of the system, is verified by comprehensive empirical experiments. |