Abstracts 2023/24

Lecture videos from the previous series:

7.11.: Jochen Büttner (Berlin, Halle): Generative KI, eine Standortbestimmung

Spätestens mit der Veröffentlichung von ChatGPT4 im März dieses Jahres ist der disruptive Charakter generativer KI auch für Nicht-ExpertInnen deutlich. Der Beitrag skizziert in groben Zügen die Schritte, die diese neue Generation generativer KI ermöglicht haben, vermittelt ein grundlegendes Verständnis ihrer Funktionsweise und diskutiert, ob und inwieweit diese Modelle bereits jetzt Anzeichen einer künstlichen allgemeinen Intelligenz aufweisen. Es wird reflektiert, wie generative KI auf dem jetzigen Stand zur Unterstützung geisteswissenschaftlicher Forschung eingesetzt werden kann, welche Probleme und Grenzen der Verwendung in der Forschung entgegenstehen und welche weiteren Entwicklungen sich abzeichnen.

21.11.: Martin Langner (Göttingen): Maße, Muster, Modelle. Computergestützte Bildanalyse in der Klassischen Archäologie

Während man schon lange spezifische Merkmale an den Bildwerken misst und vergleicht, kommen in den letzten Jahren häufiger auch Verfahren des Deep Learning zur Bildmustererkennung in der Archäologie zum Einsatz. Der eigentliche Durchbruch in der Bildanalyse scheint aber erst durch die aktuellen Vision-Language-Modelle möglich. Am Beispiel der Porträt-, Terrakotten- und Vasenforschung sollen die Einsatzmöglichkeiten der Computer Vision an 2D- und 3D-Daten vorgestellt und diskutiert werden.

5.12.: C. Antonetti, E. Paganoni (Venedig/Münster): Preserving Inscriptions on Paper: Old and New Challenges. The Experience of the Venice Squeeze Project

This paper presents the results of the “Venice Squeeze Project” (VSP), directed by Claudia Antonetti and based at the Ca’ Foscari University of Venice. The VSP promotes initiatives to preserve and enhance the collections of epigraphic squeezes. After publishing online the Ca’ Foscari collection, it is now supporting other projects aiming to digitize archives of epigraphic squeezes.

19.12.: G.R. Smidt, E. Lefever, K. De Graef, K.K.T. Chandrasekar, L. Foket (Universität Gent): MIND THE GAP! Advancing Cuneiform Studies through Digital Collaboration

The cuneiform corpus is rich in material and challenges for digital projects. With more than 500.000 cuneiform texts currently excavated, the size of the corpus makes it a cornerstone in extant ancient witness accounts. As a corpus, it is ripe with data that not only benefits researchers of cuneiform cultures, but various fields such as linguistics, economics, and philosophy. The objects themselves pose interesting challenges as well, they are written in a tradition that can be traced back to the first ever texts and they are mainly impressed into clay as a 3D script. Working with this corpus in the 21st century CE requires approaches that in some cases are still under-developed for our field. Open access to the corpus is essential as objects have been strewn in collections all over the world with often little regard to coherence of text assemblages and place of origin. Current advances in machine learning have progressed the state-of-the-art for ancient language processing. Automatic recognition of signs can speed up the process of reading the immense number of tablets and potentially mitigate issues stemming from difficult-to-read signs. Language models will help us grasp ancient languages better, we will be able to quantify observations and contextualise close readings.
When working to help develop such solutions, we recognise that cooperation between cuneiformists and digital specialists is paramount. The CUNE-IIIF-ORM project is centred around cooperation. Our goals are to disseminate, increase and augment the corpus of Old Babylonian (c. 2000-1600 BCE) Akkadian texts. Old Babylonian texts from the Royal Museums of Art and History will be digitized, annotated with meta and textual data, and linked with relevant texts. We utilise the International Image Interoperability Framework (IIIF) to create manifests that can be exhibited online in formats fitting for the user, and that can be freely accessed and redistributed. To increase the number of texts in the corpus of Old Babylonian Akkadian texts, we work with high-quality annotations of 2D+ images to create an Optical Character Recognition (OCR) model. The goal is to create a pipeline that will assist in making digitized textual publications. With Natural Language Processing (NLP) we will develop the tools needed to semi-automatically annotate Old Babylonian Akkadian texts and later query these texts, which can provide us with both a deeper and more nuanced knowledge of Akkadian.
During this talk we will introduce the project’s goals and we will account for how the three elements (IIIF, OCR and NLP) are intertwined. Furthermore, we will delve into each of the three elements separately, but mainly focus on NLP as the textual content is ultimately the message that is carried over up to 5000 years.

16.01.: Patrick Bruns (NYU): How to Read Latin like a Computer: A case study of Latin noun chunking with SpaCy

What are the strategies that people use to read Latin? What are the strategies that computer models used to “read”—that is, to process—Latin? And what we can learn about one from the other? This talk builds on my experience of reading Dexter Hoyos’s 2016 book Latin: How to Read it Fluently while training Latin language models for use with the natural language processing platform spaCy (i.e. LatinCy). Hoyos recommends a practice of reading that works left-to-right—from the first word of a sentence to its last—proceeding toward incremental and increasing awareness of the syntactic and semantic structures of the text. In particular, he shows students how they can mentally group related words into substructural units, illustrating the approach with sentence diagrams that delineate relationships between words and phrases. As I worked on LatinCy, I was struck by the (perhaps surprising) correspondence between Hoyos’ description of the Latin reading process with patterns of computational “reading” that arise in NLP model development. Just as Hoyos’s fluent Latin reader moves through the text making provisional guesses about the syntactic role of each word in context and revising these guesses as more information becomes available, so too do NLP pipelines both at training time and when deployed: text is input into pipelines where it is processed sequentially by components such as tokenizers, part-of-speech taggers, lemmatizers, and so on, in an effort to describe the text’s syntax and basic semantics. In this talk, I offer a comparison of human reading processes and computational reading processes with the specific case study of noun chunking. In order for the pipeline to properly return, for example, the noun chunk res populi Romani from Livy’s preface (i.e. Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim etc.), at a minimum the dependency parser needs to recognize the agreement of noun and adjective (populi Romani) and the correlation of the genitive with its head noun (res populi), which in turn requires that the POS tagger and the morphological tagger have performed properly as well (to say nothing of the tokenizer and lemmatizer). As I argue here, the interaction of these components reflects similar reading strategies found in Latin pedagogical literature (including Hoyos, but also Waldo Sweet, Daniel McCaffrey, Jacqueline Carlon, among others). By way of conclusion, I engage with Christopher Forstall and Walter Scheirer’s computational theory of the Latin reading mind (in 2019’s Quantitative Intertextuality) and offer some speculative remarks on where the future of Latin study may go when NLP pipelines like LatinCy are complemented and supplemented by large language models and other types of artificial textual intelligence.

16.01./17.01: Patrick Burns (NYU): NLP-Workshop: Training NLP pipelines for historical languages: Some considerations.

What are the basic computational resources necessary for training end-to-end NLP pipelines, that is a pipeline with at a minimum tokenization, part-of-speech and morphological tagging, lemmatization, and dependency parsing? And what are the special considerations that we need to take when developing such resources for historical languages? In this workshop, we will review available resources for training pipelines, such as Universal Dependency treebanks, Wikipedia and Common Crawl text collections, and fastText vectors, among others, with an eye toward the challenges faced when working with (often lesser resourced) historical languages. The workshop will use the example of adapting the project workflow for LatinCy, an end-to-end spaCy pipeline for Latin, to a comparable workflow for Ancient Greek. While these two languages (Latin and Ancient Greek) will be used for the workshop examples, participants will be encouraged to think through issues of resource availability, compatibility, size, and coverage for any historical language applicable to their own research interests. The workshop will be technical in nature, in particular with respect to defining spaCy project workflows and defining configuration files, though the goal is to keep the discussion accessible to digital humanists at all levels.

06.02. Elisa Roßberger (FU Berlin): Geht das auch schneller? Chancen und Herausforderungen bei der semi-automatisierten Verschlagwortung von Siegelbildern aus dem alten Westasien

Ziel des BMBF eHeritage Projekts „Annotated Corpus of Ancient West Asian Imagery: Cylinder Seals“ (ACAWAI-CS, 2020-2023) war es, die bildlichen Darstellungen auf ca. 20.000 Rollsiegeln und Siegelabrollungen aus dem alten Westasien, v.a. aus dem Irak und Syrien vom 3.-1. Jt. v. Chr., detailliert zu verschlagworten. Um die Prozesse zu beschleunigen und die Ergebnisse soweit wie möglich zu standardisieren, kam nach anfänglichen eigenen Versuchen mit künstlichen neuronalen Netzwerken eine KI-gestützte Annotations-Plattform zum Einsatz. Wie hilfreich Object Detection und Automated Labeling für das Projekt waren, werde ich ebenso erläutern wie die Frage, welche Verfahren bei der digitalen Erschließung von westasiatischen Siegeln sinnvollerweise in Zukunft eine Rolle spielen sollten. Die konkrete Umsetzung von Linked Open Data (LOD) und FAIR-Data Prinzipien stellt das Projekt in seiner dreijährigen Laufzeit ebenfalls vor Herausforderungen, die ich im Rahmen des Berliner Digital Classicist Seminar Vortrags diskutieren möchte.