TEI: Scholarly Publishers Collaborate on XML

Any university press considering an XML-based workflow for monographs (whether from start to finish or as an archival format) has likely discovered that the first question may also be the knottiest: what kind of XML? Or to put it in more technically accurate terms, which XML language? The answer is far from obvious.

The book markup language developed by the Association of American Publishers as long ago as the 1980s (originally in the ancestor of XML, SGML) is an international standard—ISO 12083—but to our knowledge it has been adopted by no university press other than California, and even then it required extensive modification. DocBook is well established as an authoring and archival language for books and serves publishers like O’Reilly as a natural format for “one source, many output” workflows, but it is highly optimized for technical documentation and lacks native markup elements for many structural features common in humanities and social science texts. (The University of Michigan Press has adopted it for production of some of their monograph titles, however.) EPUB/XHTML is perfectly suited to its purpose of encoding books for presentation on a wide variety of mobile devices, but its relatively impoverished set of structural and semantic tags may limit its value as an archival format for scholarly works.

An alternative increasingly being investigated is the markup language developed by the Text Encoding Initiative, or TEI, designed for the ambitious goal of creating machine-readable versions of texts in virtually any genre, from any historical period, and in any natural language. Following organizational work in the late 1980s, the first version of the TEI Guidelines was released in 1990, and was quickly adopted as the markup standard for a wide array of projects housed within university libraries and research departments engaged in digitizing books, manuscripts, drama, correspondence, and even mixed collections of text and images. Today there are literally thousands of texts encoded in TEI and in many cases published via the Web, often accompanied by a variety of full-text and data search tools (see http://www.tei-c.org/Activities/Projects/ for a list of over 100 such sites). The TEI Guidelines are actively maintained and developed by the TEI Consortium, with an international group of directors and editors from a variety of scholarly and professional backgrounds.

Clearly TEI-XML can be used to produce archival machine-readable versions of published books; existing off-the-shelf tools can be used to convert those files to HTML, PDF, and EPUB, although achieving results satisfactory to a professional publisher will usually require more or less customization. But is TEI-XML a viable answer to the XML workflow question? Can a publisher develop in-house procedures for converting existing books to an archival TEI format, or find a vendor capable of doing so? Alternatively, is it feasible to insert TEI-XML into the authoring workflow, so that it becomes the underlying source of both print and digital versions of a book? Over the past year or so, members of both the TEI and the university press communities have been meeting online and in person to address such questions.

The TEI Guidelines in their current form (version “P5”) are incredibly rich and comprehensive (over 1,400 pages in PDF form!), so approaching them can be quite daunting. The TEI Special Interest Groups (SIGs) were created to allow individuals to share ideas and develop much more focused uses of TEI. For the most part, these SIGs have been based in the academy, and centered on humanities scholarship, but they are open to anyone. The Scholarly Publishing SIG was created in June 2009 in order to explore the use of TEI in original scholarly publication. One of the aims of this SIG is to make TEI an attractive choice when deciding upon which XML language to use. XML is a costly investment: there will be a lot of time and resources devoted to its implementation. The university press community needs to collaborate on this front, and this SIG would serve as the starting point for progress. It will enable presses interested in using TEI to share developments with peer institutions as well as with the wider TEI community.

It is quite common to hear that TEI is a standard that is not implemented in any standard way. The SIG maintains a Wiki page that has a section on recommended practices. This document is still in its inception, but the purpose will be to create, through a collaborative process, a set of encoding guidelines that presses can use, either in XML-first or XML-last workflows. These guidelines could be used for in-house composition, or they could be supplied to encoding vendors for conversion after print publication. If enough presses adopt these guidelines, they could be used to set up common encoding practices and offer advantages when approaching vendors for XML encoding work, in much the same way that TEI Tite is being developed. These guidelines may lead to a specific customization of TEI for publishing, across books and journals.

The SIG will also focus on the XML workflow itself, and the tools required for such a workflow. There exists already a roundtrip transformation from Microsoft Word to TEI that could be improved upon through real-world use cases. Similarly, there are transformations for TEI to HTML and to EPUB. These need to be investigated and refined as well.

Another benefit that could be derived from a collaborative effort among university presses is the creation of a set of quality control rules using the rule-based validation language Schematron. Having well-formed and valid XML is only the first step—the XML needs to be checked with the same care and attention given to the print version. Having high-quality XML for use as the archival format for our content is vital. Presses need to be assured of this quality when they use the XML version to generate other formats, such as HTML or EPUB—or even for later editions in print. Creating a set of rules that every press can use to test their content would greatly aid in this effort.

A symposium was held at the Digital Humanities Observatory in Dublin on 28 April 2010 (see http://dho.ie/node/673) in order to discuss the growing interest in the use of TEI in scholarly publishing. The TEI community discussed their possible emerging role in scholarly communication and publishing. While the symposium ended with the question very much open, it was clear that coordination of work through the SIG was required. The TEI community has yet to decide whether they should focus their energies on tool development in this area, or on a specific customization of TEI for publishing, or even if they should engage in the publishing process directly. The university press community should take this moment to work together with the TEI community in order to make the transition to digital publishing.

David Sewell, Editorial and Technical Manager, ROTUNDA, University of Virginia Press
Kenneth Reed, Digital Production Specialist, The University of North Carolina Press