Digital Classicist Berlin == Cataloguing Open Access Classics Serials

A past well hidden: tracing and visualizing Roman infrastructure in medieval charters

The iDAI.publications from open digital publishing to text mining

Cataloguing Open Access Classics Serials

Partner Sites

Supported by

Cataloguing Open Access Classics Serials

Posted on 22 January 2018

Talk: Simona Stoyanova & Gabriel Bodard (Institute of Classical Studies, University of London), “Cataloguing Open Access Classics Serials”.

Permalink: http://hdl.handle.net/21.11117/0000-0000-60EA-C

Date: Monday, 22 January 2018

Time: starting at 17:00 c.t. (i.e. 17:15)

Venue: Humboldt-Universität zu Berlin, Gebäude Hausvogteiplatz (Raum 0319). Address: Hausvogteipl. 5-7, 10117 Berlin map

Abstract

The catalogues of most research libraries offer access to electronic journals to which they subscribe, but rarely include the thousands of open access journals also of interest to their readers, hiding many of these publications from scholarly view, and slowing the acceptance and adoption of open access publishing. The Cataloguing Open Access Classics Serials project harvests bibliographic metadata of ca. 1500 open access journals (containing a little over 50,000 articles) in Classics and Ancient History, found in various sites that list or index open access publications, starting from the Ancient World Online (AWOL) index. This is comparable to the number of classics journals included in a silo such as JStor, with the added value that they are not only free but usually licensed for redistribution or reuse in some sense. Since the catalogue information for these journals is available in various formats (JSON, HTML, MARC), we transform them into a standard library metadata format (MARC). As a first test, we then ingest them into the Classics library catalogue of our institution, providing users with access from the search interface alongside print and subscription titles. A number of the journals expose their metadata in standard formats (OAI-PMH, MARCXML, Dublin Core) that can be easily transformed to the format best suited to a library catalogue. A significant number of them offer only basic bibliographic information in unstructured HTML, for which a different harvesting strategy is needed, to strip the content from the HTML, structure as JSON and transform into MARC. A certain amount of manual curation and trial- and-error-based iteration is involved in the harvesting process. By the end of the project we will have added around 1500 new records to our Classics library catalogue, contributing to the research practice of scholars and other library users, and enhancing the visibility and accessibility of open access materials. The bundle of JSON, MARCXML and MARC records will also be made available for other libraries to ingest and to facilitate reuse and further development of the data outside of the constraints of library catalogues. All code (Python using the JSON encoder and decoder, Beautiful Soup and pymarc libraries) for metadata harvesting and transformation will be published in an open source repository. We consider this project to be a pilot for future work, which might include access to the raw text of open access and open licensed scholarship for text mining, entity extraction and other linguistic and computational approaches.

Literature

An Integrated Approach to Metadata Interoperability: Construction of a Conceptual Structure between MARC and FRBR. Seungmin Lee, Elin K. Jacob, Library Resources & Technical Services, Vol 55, No 1 (2011) (https://journals.ala.org/index.php/lrts/article/view/5539/6818)

S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, O’Reilly, 2009. Available at: http://www.nltk.org/book/: Ch. 1 “Language Processing and Python”: sections 1 & 2 (http://www.nltk.org/book/ch01.html)