This case study discusses the key decisions in adopting standards and technologies for a digitization project, in dialogue with ongoing scholarship around minimal computing and minimal editions, with specific focus on choices that affect long-term preservation and access.
by Raffaele Viglianti, Research Programmer, Maryland Institute for Technology in the Humanities at University of Maryland
The Shelley-Godwin Archive (S-GA) aims to provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online the widely dispersed handwritten legacy of this uniquely gifted family of writers. The archive, started in 2013, has published a number of manuscripts, including the ones containing the draft of Mary Shelley’s Frankenstein, the fair copy of Percy Shelley’s Prometheus Unbound, and William Godwin’s Political Justice. Both Frankenstein and Prometheus Unbound also include transcriptions of the manuscripts that attempt to faithfully reproduce the writing acting as a “map” for reading and comprehending the manuscript page. While the holdings of S-GA continue to grow, the technological stack at the core of this online publication is fairly stable at this point in time. In this case study, we discuss key decisions in adopting standards and technologies for the project, in dialogue with ongoing scholarship around minimal computing and minimal editions. In particular, we will describe those choices that affected the chances for the project’s long-term preservation and access. Finally, the case study focuses on exploratory efforts in enabling offline use of the archive in order to increase its availability to a larger number of communities with variable access to the Internet.
The technologies behind S-GA: a short overview
S-GA primarily deals with two kinds of data: the facsimile images of the manuscripts and, when available, their transcriptions. Images are handled through a specification published by the International Image Interoperability Framework (IIIF, pronounced “triple-eye-eff”) Consortium. This specification provides a standardized way of serving and requesting images or specific regions of images according to various parameters such as size, resolution, or rotation. IIIF makes it possible for digital libraries to host and serve images and for projects such as S-GA to request and embed them in their applications. The specification is quickly being adopted by major libraries across the globe. The Oxford Bodleian Library holds a number of manuscripts of interest to S-GA, many of which have been digitized with funding from the archive. The Bodleian Library is now serving these images with IIIF, which makes it possible for S-GA to embed them directly from the library, without needing to host them as part of the project infrastructure. This reduces the infrastructural footprint of the project, which we argue increases its chances for future preservation.
The transcriptions are encoded according to the Text Encoding Initiative (TEI) Guidelines. The TEI, expressed using the Extensible Markup Language (XML) technology, is a long-standing standard in the digital humanities and is used globally for text encoding in academic projects. S-GA uses a particular subset of the TEI, recently added to the standard, that focuses on identifying and describing writing areas, their layout on the page, and authorial revisions (such as additions and deletions). Texts encoded in TEI are typically published on the web, which makes it possible to create interactive reading experiences. TEI’s approach also enables the same encoded content to be delivered in other formats and media, such as e-book and PDF for print. Web-based interactive digital editions, however, are the most efficient in leveraging TEI’s ability to formalize, in a machine-readable way, scholarship as well as text. In S-GA’s case, for example, “scholarship” encoded in the archive’s TEI includes (but is not limited to) the identification and description of text written in different “hands” (different people or writing medium) and the sequence of textual revisions.
Ongoing scholarship around minimal computing and minimal editions has pointed out some important, yet addressable, flaws of many TEI-powered digital editions. Bloatedness of infrastructure is one of them. Particularly when paired with rapid technical obsolescence and changes in funding, it can hamper long-term preservation efforts; weighty resources may not be easily accessible from slower connections and online-only access to a digital edition can be an obstacle to world-wide access. This bloatedness is often induced by the necessity to transform TEI into HTML to be displayed in web browsers. Because this is a lossy transformation, often TEI is handled through an XML database or other server component able to query and transform it on the fly into the HTML required by a specific view on the text. This may, indeed, often be the best approach depending on the goals of a project and their funding for the present and the future. However, S-GA intentionally experimented with reducing this dependence on transforming TEI into HTML.
Experimenting with offline access
By offline access we mean being able to download and keep a copy of part of the archive, such as a specific manuscript or work, and be able to view it offline in the browser with the look and features of the site largely unchanged. We suggested earlier that TEI is best published on the web because of the interactivity that the medium enables; therefore, it was a priority to maintain this interactivity in the offline version. The images, transcriptions, and other creative aspects of the site are all openly licensed, which makes it possible for individuals to download, keep, and modify their copy. We tried three different approaches, each with their advantages and disadvantages.
One document does it all: HTML
- Pros: the resulting HTML file can simply be “opened” in a browser, including mobile browsers.
- Cons: Images cannot be full size: base64 representation is approximately 37% larger than the original, so we opted to half their size. Moreover, because they are bundled in the HTML, individual resources are not easily accessible outside of the website.
This same approach to mapping resources could theoretically be employed to store the site in the browser’s “local store”. This would allow users to visit the site again when offline, but they would not “own” the data, so to speak, to re-use it or otherwise manipulate it.
Zip it up
- Pros: The images can be downloaded via IIIF protocols when creating the archive and can be full size. It is possible to include multiple pages such as the table of contents and contextual pages. Moreover, the user gets easy access to all the resources outside of the website.
Electron is an open source technology to build cross-platform desktop applications using web technologies. In this experiment, we use it to wrap the same content needed for the Zip archive version and compile it as applications for Windows, Mac and Linux.
- Pros: Similarly to the Zip archive, images and other resources can be included in full. The applications are installed on the user’s computer and can be customized to be given a more official look.
- Cons: There is currently no support for mobile devices and no easy way to access resources outside of the application. Finally, the resulting applications tend to be markedly larger than the site by itself.
Accessible and offline futures for TEI projects
The “one document does it all” approach of bundling everything in one HTML file is the most versatile and mobile friendly, this latter advantage being critical for enabling more access in the global south. Unfortunately, images cannot be too large in size. This may nonetheless be a very desirable option for digital editions that do not include facsimile images, but it is problematic for the Shelley-Godwin Archive. High quality images, particularly when prepared to be explored at different zoom levels, take a lot of space. E.g. one of the Frankenstein manuscripts “Abinger c. 57” is over 700MB with images at half size. Additionally, hosting pre-generated offline-ready resources for download can be costly; this is currently an obstacle, and ultimately a deal breaker, for S-GA to permanently provide downloadable content across the site. If high-resolutions images are included in downloadable packages, moreover, users with slow internet connections or limited data plans may not be able to download the offline content. However, they may still rely on free access points such as public libraries to obtain the data and then use it offline elsewhere. Finally, creating TEI resources that can be used offline mostly requires creating a static site. S-GA uses Jekyll and a focus on client-side technology, but XSL, a much more common technology for TEI work, can also be used to generate fully-functional static sites.
Finally, we have learned that creating offline versions of complex TEI-based projects is more feasible than we thought at the beginning of this experiment; yet, compromises on quality of data may be necessary and preparations for offline use require dedicated labor. It remains to be verified whether implementing offline access can, in fact, increase the availability of digital humanities resources for those with limited access to the Internet; however, we believe it to be a necessary step for further investigation.