I have sent this post simultaneously to the eXist, baseX, and TEI lists in the hope that I will get some useful advice on how to cobble together an XML-based collaborative curation platform. If there is something "out there" that meets some of my requirements I would love to hear about it.
Phil Burns, Craig Berry, and I have been engaged in an informal project that I call Shakespeare His Contemporaries (SHC). The source texts for SHC are the TCP transcriptions, transformed into TEI P5 with Abbot and linguistically with MorphAdorner. To goal is to produce "good enough" texts (in original and standardized spellings) through collaborative curation, enlisting the help of educated amateurs as well as professional scholars. A group of Northwestern undergraduate on summer internships fixed about 70% of some 45,000 known errors in almost 500 Early Modern non-Shakespearean plays and demonstrated that you don't need to have a Ph.D. to make valuable contributions to scholarship.
My curation framework has been a mixture of XML and SQL. MorphAdorner can spit out its data as a TEI file or in a tabular format.(http://morphadorner.northwestern.edu). The latter is easily turned into a MySQL database. AnnoLex, our curation tool (http://annolex.at.northwestern.edu), was built by Craig Berry. It is a Django web site that talks to an underlying MySQL database and lets registered users make emendations that are kept in a separate table for subsequent review and integration into the XML files. MorphAdorner includes scripts for updating XML files in the light of approved emendations. It also keeps track of changes made.
This system works well in environments where curators operate in a controlled and work-like environment, cycling through textual phenomena selected by some criteria likely to return incomplete or incorrect text. It works well because most of the errors in the TCP transcriptions are of a very simple kind that lends itself to an 'atomic' word by word treatment.The combination of Django and a relational database makes it very easy to keep track of who did what where and when, and who approved or rejected an emendation. These are non-trivial virtues, and there are many TCP errors that can be fixed by this method before you run up against its limits.
On the other hand, this method does not support what I call "curation en passant," readers reading "with a pencil" or its digital equivalent, stopping here or there to question a word or passages and offering a diagnosis or emendation as a marginal gloss. I would like to have an curation environment that looks like a page that supports reading of the common garden variety but through a click on a word shows a pop-up window that lists the properties of a token and offers a template for structured and free-form annotation.
The current model also fails when it comes to making changes in the XML structure. In terms of the immediate needs of a project that focus on EEBO-TCP drama texts, the most common errors involve just the correction of a tag name: turning <stage> into <l> or the other way round. There are ways in which these errors could be corrected by extending AnnoLex, but it would be clumsy, and it would not support more substantial changes in the XML structure. So you want a curation environment that lets you manipulate the XML structure as well as the words inside elements. For instance, half the plays in the EEBO-TCP collections are not divided into acts and scenes because such a division is lacking in the printed source. Adding scene divisions (with an appropriate notation that they are supplied by an editor) is an important step towards making the texts more "machine actionable," and a proper digital corpus should be both human readable and machine actionable.
The best curation environment would be XML based, but I wonder about scale. As I understand it, 'scale' in an XML application is a function of the number of documents, their size, and the complexity of encoding. The SHC corpus has about 500 documents and could grow to 1,000 if one interpreted "contemporaries" more generously or decided to include Shakespeare's Restoration heirs. Most plays fall in a range between 15,000 and 25,000 words. In a linguistically annotated corpus the typical leaf node is a <w> or <pc> element with a set of attributes, such as
<w xml:id="A04632_0500750" lemma ="Cleopatra", ana="#npg1">Cleopatraes</w>
The modal XPath of such a leaf node goes like
TEI/text/body/div@act/div@scene/sp/l/w
There are about 12 million Xpaths in the SHC corpus, and it is important for readers to be able to move quickly from one to the other within and across plays so as to support what I call "cascading curation," where the diagnosis of an error in one play leads to the question whether there are similar errors elsewhere.
So much for scale, and I don't know whether it is a big problem, a small problem, or no longer a problem at all. If you want to extend a curation model from 500 plays to the eventually 70,000 texts in the EEBO TCP corpus there might be a scale problem. On the other hand, if you think of collaborative curation in a humanities environment and you think of scholarly data communities that organize themselves around some body of texts, it may be that something like 1,000 texts or 50 millon words words is a good enough upper limit: breaking down the big city of a large corpus into "corpus villages" (perhaps with some overlap) may be technically and socially preferable.
I don't have a very clear idea how a permissions regime would work in an XML environment. It is clear to me that a change of any kind should take the form of an annotation and that the integration of such an annotation into the source text would be the work of an editor with special privileges. There needs to be a clear trail of the 'who, what, when, and where' of any textual change, and as much a possible of this detail should be logged automatically. The scholarly communities I'm familiar with are unlikely to accept texts as curated unless there are familiar seals of good housekeeping that they recognize as the equivalent of the assurances that come with a reputable printed text.
I could go on in considerable detail about the needs of our project (and similar ones), but I hope I have outlined enough of the requirements in sufficient detail.
Thanks in advance for any advice
Martin Mueller Professor emeritus of English and Classics Northwestern University
Hi Martin,
sorry for letting you wait, and thanks for giving a summary of your project.
Storing, indexing and querying gigabytes of XML data is something that should be no major problem (some out-dated statistics can be found here [1]; please note that the create databases did not include any index structures). I assume you have already stumbled upon XQuery Full Text, which also allows you to do text-based search [2].
Talking about scalability, do you have an approximate guess on the total byte size of XML documents to be managed? Maybe the easiest thing would be to simply run BaseX, create a first database from an initial collection.
It surely gets more interesting and challenging when the original data is to be changed, i.e. if texts are annotated. In this case, I would recommend to keep the original documents untouched and well-indexed, and store changes in an additional database. Node IDs could be used as back references [3], and the updates could be merged back to the original data in regular time intervals. As more than one databases can be addressed by a single query, original and updated nodes can also be merged on the fly, using XQuery Update [4].
Feel free to ask for more details, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/Full-Text [3] http://docs.basex.org/wiki/Database_Module#db:open-pre [4] http://docs.basex.org/wiki/XQuery_Update
basex-talk@mailman.uni-konstanz.de