Is there an XML based collaborative curation environment? - BaseX-Talk - mailman.uni-konstanz.de

27 Aug 2014


      I have sent this post simultaneously to the eXist, baseX, and TEI lists in
the hope that I will get some useful advice  on how to cobble together an
XML-based collaborative curation platform. If there is something "out
there" that meets some of my requirements I would love to hear about it.
Phil Burns, Craig Berry, and I  have been engaged in an informal project
that I call Shakespeare His Contemporaries (SHC). The source texts for SHC
are the TCP transcriptions, transformed into TEI P5 with Abbot and
linguistically with MorphAdorner. To goal is to produce "good enough"
texts (in original and standardized spellings) through collaborative
curation, enlisting the help of educated amateurs as well as professional
scholars. A group of Northwestern undergraduate on summer internships
fixed about 70% of some 45,000 known errors in almost 500 Early Modern
non-Shakespearean plays and demonstrated that you don't need to have a
Ph.D. to make valuable contributions to scholarship.
My curation framework has been a mixture of XML and SQL. MorphAdorner can
spit out its data as a TEI file or in a tabular
format.(http://morphadorner.northwestern.edu). The latter is easily turned
into a MySQL database.  AnnoLex, our curation tool
(http://annolex.at.northwestern.edu), was built by Craig Berry. It is a
Django web site that talks to an underlying MySQL database and lets
registered users make emendations that are kept in a separate table for
subsequent review and integration into the XML files. MorphAdorner
includes scripts for updating XML files in the light of approved
emendations. It also keeps track of changes made.
This system works well in environments where curators operate in a
controlled and work-like environment, cycling through textual phenomena
selected by some criteria likely to return incomplete or incorrect text.
It works well  because most of the errors in the TCP transcriptions are of
a very simple kind that lends itself to an 'atomic' word by word
treatment.The combination of Django and a relational database makes it
very easy to keep track of who did what where and when, and who approved
or rejected an emendation. These are non-trivial virtues, and there are
many TCP errors that can be fixed by this method before you run up against
its limits.
On the other hand, this method does not support what I call "curation en
passant," readers reading "with a pencil" or its digital equivalent,
stopping here or there to question a word or passages and offering a
diagnosis or emendation as a marginal gloss. I would like to have an
curation environment that looks like a page that supports reading of the
common garden variety but through a click on a word  shows a pop-up window
that lists the properties of a token and offers a template for structured
and free-form annotation.
The current model also fails when it comes to making changes in the XML
structure. In terms of the immediate needs of a project that focus on
EEBO-TCP drama texts, the most common errors involve just the correction
of  a tag name: turning <stage> into <l> or the other way round. There are
ways in which these errors could be corrected by extending AnnoLex, but it
would be clumsy, and it would not support more substantial changes in the
XML structure. So you want a curation environment that lets you manipulate
the XML structure as well as the words inside elements.  For instance,
half the plays in the EEBO-TCP collections are not divided into acts and
scenes because such a division is lacking in the printed source.  Adding
scene divisions (with an appropriate notation that they are supplied by an
editor) is an important step towards making the texts more "machine
actionable," and a proper digital corpus should be both human readable and
machine actionable.
The best curation environment would be XML based, but I wonder about
scale. As I understand it, 'scale' in an XML application is a function of
the number of documents, their size, and the complexity of encoding. The
SHC corpus has about 500 documents and could grow to 1,000 if one
interpreted "contemporaries" more generously or decided to include
Shakespeare's Restoration heirs. Most plays fall in a range between 15,000
and 25,000 words. In a linguistically annotated corpus the typical leaf
node is a <w> or <pc> element with a set of attributes, such as
<w xml:id="A04632_0500750" lemma ="Cleopatra", ana="#npg1">Cleopatraes</w>
The modal XPath of such a leaf node goes like
TEI/text/body/div@act/div@scene/sp/l/w
There are about 12 million Xpaths in the SHC corpus, and it is important
for readers to be able to move quickly from one to the other within and
across plays so as to support what I call "cascading curation," where the
diagnosis of an error in one play leads to the question whether there are
similar errors elsewhere.
So much for scale, and I don't know whether it is a big problem, a small
problem, or no longer a problem at all. If you want to extend a curation
model from 500 plays to the eventually 70,000 texts in the EEBO TCP corpus
there might be a scale problem. On the other hand, if you think of
collaborative curation in a humanities environment and you think of
scholarly data communities that organize themselves around some body of
texts, it may be that something like 1,000 texts or 50 millon words words
is a good enough upper limit: breaking down the big city of a large corpus
into "corpus villages" (perhaps with some overlap) may be technically and
socially preferable.
I don't have a very clear idea how a permissions regime would work in an
XML environment. It is clear to me that a change of any kind should take
the form of an annotation and that the integration of such an annotation
into the source text would be the work of an editor with special
privileges. There needs to be a clear trail of the 'who, what, when, and
where' of any textual change, and as much a possible of this detail should
be logged automatically.   The scholarly communities I'm familiar with are
unlikely to accept texts as curated unless there are familiar  seals of
good housekeeping that they recognize as the equivalent of the assurances
that come with a reputable printed text.
I could go on in considerable detail about the needs of our project (and
similar ones), but I hope I have outlined enough of the requirements in
sufficient detail.
Thanks in advance for any advice
Martin Mueller
Professor emeritus of English and Classics
Northwestern University