This case study discusses the key decisions in adopting standards and technologies for a digitization project, in dialogue with ongoing scholarship around minimal computing and minimal editions. It has a specific focus on choices that affect long-term preservation and access, including efforts to enable offline use of the archive in order to increase its availability to a larger number of communities with variable access to the Internet.
by Raffaele Viglianti, Research Programmer, Maryland Institute for Technology in the Humanities at University of Maryland
Introduction
The Shelley-Godwin Archive (S-GA) aims to provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online the widely dispersed handwritten legacy of this uniquely gifted family of writers. The archive, started in 2013, has published a number of manuscripts, including the ones containing the draft of Mary Shelley’s Frankenstein, the fair copy of Percy Shelley’s Prometheus Unbound, and William Godwin’s Political Justice. Both Frankenstein and Prometheus Unbound also include transcriptions of the manuscripts that attempt to faithfully reproduce the writing acting as a “map” for reading and comprehending the manuscript page. While the holdings of S-GA continue to grow, the technological stack at the core of this online publication is fairly stable at this point in time. In this case study, we discuss key decisions in adopting standards and technologies for the project, in dialogue with ongoing scholarship around minimal computing and minimal editions. In particular, we will describe those choices that affected the chances for the project’s long-term preservation and access. Finally, the case study focuses on exploratory efforts in enabling offline use of the archive in order to increase its availability to a larger number of communities with variable access to the Internet.
The technologies behind S-GA: a short overview
S-GA primarily deals with two kinds of data: the facsimile images of the manuscripts and, when available, their transcriptions. Images are handled through a specification published by the International Image Interoperability Framework (IIIF, pronounced “triple-eye-eff”) Consortium. This specification provides a standardized way of serving and requesting images or specific regions of images according to various parameters such as size, resolution, or rotation. IIIF makes it possible for digital libraries to host and serve images and for projects such as S-GA to request and embed them in their applications. The specification is quickly being adopted by major libraries across the globe. The Oxford Bodleian Library holds a number of manuscripts of interest to S-GA, many of which have been digitized with funding from the archive. The Bodleian Library is now serving these images with IIIF, which makes it possible for S-GA to embed them directly from the library, without needing to host them as part of the project infrastructure. This reduces the infrastructural footprint of the project, which we argue increases its chances for future preservation.
The transcriptions are encoded according to the Text Encoding Initiative (TEI) Guidelines. The TEI, expressed using the Extensible Markup Language (XML) technology, is a long-standing standard in the digital humanities and is used globally for text encoding in academic projects. S-GA uses a particular subset of the TEI, recently added to the standard, that focuses on identifying and describing writing areas, their layout on the page, and authorial revisions (such as additions and deletions). Texts encoded in TEI are typically published on the web, which makes it possible to create interactive reading experiences. TEI’s approach also enables the same encoded content to be delivered in other formats and media, such as e-book and PDF for print. Web-based interactive digital editions, however, are the most efficient in leveraging TEI’s ability to formalize, in a machine-readable way, scholarship as well as text. In S-GA’s case, for example, “scholarship” encoded in the archive’s TEI includes (but is not limited to) the identification and description of text written in different “hands” (different people or writing medium) and the sequence of textual revisions.
Ongoing scholarship around minimal computing and minimal editions has pointed out some important, yet addressable, flaws of many TEI-powered digital editions. Bloatedness of infrastructure is one of them. Particularly when paired with rapid technical obsolescence and changes in funding, it can hamper long-term preservation efforts; weighty resources may not be easily accessible from slower connections and online-only access to a digital edition can be an obstacle to world-wide access. This bloatedness is often induced by the necessity to transform TEI into HTML to be displayed in web browsers. Because this is a lossy transformation, often TEI is handled through an XML database or other server component able to query and transform it on the fly into the HTML required by a specific view on the text. This may, indeed, often be the best approach depending on the goals of a project and their funding for the present and the future. However, S-GA intentionally experimented with reducing this dependence on transforming TEI into HTML.
Both the image and TEI (text) data are organized using a Shared Canvas “manifest”. Shared Canvas was a technology pioneered at Stanford University to organize and arrange images in sequences and annotate them; each item in a sequence is a “canvas” which can be annotated with with multiple images (e.g. photographs of the same manuscript page taken under different types of light) and other multimedia annotations. S-GA uses Shared Canvas to “annotate” images with text from the archive’s TEI documents. A JavaScript viewer, which runs in the user’s web browser, shows the images and the text from the TEI without needing dedicated transformation steps on the server (see Fig. 1). Shared Canvas has now been integrated into an equivalent and improved IIIF specification that deals with the presentation of images and is distinct from the one for serving and requesting images described before. There now are viewers capable of rendering images and annotations following this IIIF specification, but, as will be discussed later, S-GA has yet to transition to the newer IIIF version.
Finally, our JavaScript viewer is built into a static site together with all the other pages (introductions, table of contents, etc), using a popular static site generator: Jekyll. In its production form, with the exception of its search index, S-GA is a collection of TEI, HTML, CSS, and JavaScript that can be hosted on any server without needing to setup any server-side component. This approach also makes it possible to bundle resources together for offline use, a feature that could substantially improve the usability and access to the site for a larger number of communities with variable access to the Internet.
Experimenting with offline access
By offline access we mean being able to download and keep a copy of part of the archive, such as a specific manuscript or work, and be able to view it offline in the browser with the look and features of the site largely unchanged. We suggested earlier that TEI is best published on the web because of the interactivity that the medium enables; therefore, it was a priority to maintain this interactivity in the offline version. The images, transcriptions, and other creative aspects of the site are all openly licensed, which makes it possible for individuals to download, keep, and modify their copy. We tried three different approaches, each with their advantages and disadvantages.
One document does it all: HTML
In this experiment, all resources for one manuscript (the viewer, images, and TEI) are combined into one HTML document. A JavaScript object maps links to the resources with their string representation. In order to be included in the HTML document, images are cast to string in base64 (is a way of representing binary data in a string format) a common practice in the modern web.
- Pros: the resulting HTML file can simply be “opened” in a browser, including mobile browsers.
- Cons: Images cannot be full size: base64 representation is approximately 37% larger than the original, so we opted to half their size. Moreover, because they are bundled in the HTML, individual resources are not easily accessible outside of the website.
This same approach to mapping resources could theoretically be employed to store the site in the browser’s “local store”. This would allow users to visit the site again when offline, but they would not “own” the data, so to speak, to re-use it or otherwise manipulate it.
Zip it up
In this experiment, all resources for one manuscript are organized into a Zip archive. This approach does not require any change to the existing static site, unlike the HTML approach which requires a dedicated JavaScript object for mapping resources.
- Pros: The images can be downloaded via IIIF protocols when creating the archive and can be full size. It is possible to include multiple pages such as the table of contents and contextual pages. Moreover, the user gets easy access to all the resources outside of the website.
- Cons: The HTML pages cannot simply be opened in a browser because, in order for the JavaScript viewer to properly access resources (images and TEI) from the computer, the site needs to be made available via a server. While setting up and running a local HTTP server is not too complicated, it can be a major hurdle for those less familiar with how the web works. This approach is also not feasible on most mobile devices.
Electron
Electron is an open source technology to build cross-platform desktop applications using web technologies. In this experiment, we use it to wrap the same content needed for the Zip archive version and compile it as applications for Windows, Mac and Linux.
- Pros: Similarly to the Zip archive, images and other resources can be included in full. The applications are installed on the user’s computer and can be customized to be given a more official look.
- Cons: There is currently no support for mobile devices and no easy way to access resources outside of the application. Finally, the resulting applications tend to be markedly larger than the site by itself.
Accessible and offline futures for TEI projects
The “one document does it all” approach of bundling everything in one HTML file is the most versatile and mobile friendly, this latter advantage being critical for enabling more access in the global south. Unfortunately, images cannot be too large in size. This may nonetheless be a very desirable option for digital editions that do not include facsimile images, but it is problematic for the Shelley-Godwin Archive. High quality images, particularly when prepared to be explored at different zoom levels, take a lot of space. E.g. one of the Frankenstein manuscripts “Abinger c. 57” is over 700MB with images at half size. Additionally, hosting pre-generated offline-ready resources for download can be costly; this is currently an obstacle, and ultimately a deal breaker, for S-GA to permanently provide downloadable content across the site. If high-resolutions images are included in downloadable packages, moreover, users with slow internet connections or limited data plans may not be able to download the offline content. However, they may still rely on free access points such as public libraries to obtain the data and then use it offline elsewhere. Finally, creating TEI resources that can be used offline mostly requires creating a static site. S-GA uses Jekyll and a focus on client-side technology, but XSL, a much more common technology for TEI work, can also be used to generate fully-functional static sites.
Even though S-GA will likely not be able to permanently provide downloadable content, we have learned important lessons for our future TEI projects. Simplifying the technological infrastructure of TEI projects is advantageous to both creators and end users. Our experience with rendering TEI without server-side components prompted S-GA technical editor Raffaele Viglianti to collaborate with Hugh Cayless, who reached similar conclusions from his work at the Duke Collaboratory for Classics Computing. This collaboration resulted in the creation of CETEIcean (pronounced “su-TAY-tion”, a play on the word cetacean), a JavaScript library for rendering TEI in the browser without modification, which is currently gaining ground in TEI teaching and project development.
Finally, we have learned that creating offline versions of complex TEI-based projects is more feasible than we thought at the beginning of this experiment; yet, compromises on quality of data may be necessary and preparations for offline use require dedicated labor. It remains to be verified whether implementing offline access can, in fact, increase the availability of digital humanities resources for those with limited access to the Internet; however, we believe it to be a necessary step for further investigation.
Print This Page