5 Jul
2016
5 Jul
'16
5:52 p.m.
Hi James,
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
Some (more or less obvious) questions back: * How large is the resulting XML document (around 15% of the original document)? * How do you measure the time? * Do you store the result on disk? * How long does your query take if you wrap it into a count(...) or prof:void(...) function? Thanks in advance, Christian