Hi Christian - I've built a new database, using the same data except that this time I stripped out the OCR'd word elements (called <wd/>).

My estimate of the <wd/> elements representing 85% of the data was wrong. they represent 96.5% of the data. This means the database files have shrunk from 40GB to 1.5GB.

Instead of the database having ~1.5 billion nodes it now has ~78 million.

Reducing the problem space means the following xquery - run in basexgui 8.5 - has gone from an average of 148000ms to 3900ms:

let $start := prof:current-ns()
let $void := prof:void(for $book in //book
return
  <result>
    <book id="{$book/id/text()}"/>
    {
    for $page in $book/page
    return
      <page id="{$page/id/text()}">
      {
      for $article in $page/article
      return
          <article id="{$article/id/text()}"/>
      }
      </page>
    }
  </result>)
let $end := prof:current-ns()
let $ms := ($end - $start) div 1000000
return $ms || ' ms'

This is good news. However, doesn't this show an issue in how BaseX maintains it's indexes? What I mean is that the <wd/> elements are two children off each <article/> - i.e. <article/><p/><wd/>. If my xquery doesn't care about the <wd/> and the <p/> elements - why is it still affected by them?

Thanks.

> From: christian.gruen@gmail.com
> Date: Tue, 5 Jul 2016 17:52:40 +0200
> Subject: Re: [basex-talk] Improving performance in a 40GB database
> To: james.hn.sears@outlook.com
> CC: basex-talk@mailman.uni-konstanz.de
>
> Hi James,
>
> > Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
>
> Some (more or less obvious) questions back:
>
> * How large is the resulting XML document (around 15% of the original document)?
> * How do you measure the time?
> * Do you store the result on disk?
> * How long does your query take if you wrap it into a count(...) or
> prof:void(...) function?
>
> Thanks in advance,
> Christian