Hi James,
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
Some (more or less obvious) questions back:
* How large is the resulting XML document (around 15% of the original document)? * How do you measure the time? * Do you store the result on disk? * How long does your query take if you wrap it into a count(...) or prof:void(...) function?
Thanks in advance, Christian