Hi - I'm wondering what sort of recommendations anyone has for making my queries "go faster".
I created a database by loading 21GB of xml data - that represents the OCR'd contents of 5000 books - into a new basex 8.4.4 instance.
I run optimize after the import.
basexgui> Properties> Information says it's sized at 40GB and contains around 1.5 billion nodes; there are no binaries.
The sorts of query I'm interested in running are those that search each article on each page of each book - for example:
for $book in //book return <result> <book id="{$book/id/text()}"/> { for $page in $book/page return <page id="{$page/id/text()}"> { for $article in $page/article return <article id="{$article/id/text()}"/> } </page> } </result>
On a reasonably powerful i5 laptop with SSD + plenty of RAM, this query takes around 148550.00 ms - I'd like to significantly reduce this.
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
So, if anyone has any tips or ideas for reducing the query time then I'd be very interested in hearing what you have to say.
Hi James,
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
Some (more or less obvious) questions back:
* How large is the resulting XML document (around 15% of the original document)? * How do you measure the time? * Do you store the result on disk? * How long does your query take if you wrap it into a count(...) or prof:void(...) function?
Thanks in advance, Christian
Hi Christian - I've built a new database, using the same data except that this time I stripped out the OCR'd word elements (called <wd/>).
My estimate of the <wd/> elements representing 85% of the data was wrong. they represent 96.5% of the data. This means the database files have shrunk from 40GB to 1.5GB.
Instead of the database having ~1.5 billion nodes it now has ~78 million.
Reducing the problem space means the following xquery - run in basexgui 8.5 - has gone from an average of 148000ms to 3900ms:
let $start := prof:current-ns() let $void := prof:void(for $book in //book return <result> <book id="{$book/id/text()}"/> { for $page in $book/page return <page id="{$page/id/text()}"> { for $article in $page/article return <article id="{$article/id/text()}"/> } </page> } </result>) let $end := prof:current-ns() let $ms := ($end - $start) div 1000000 return $ms || ' ms'
This is good news. However, doesn't this show an issue in how BaseX maintains it's indexes? What I mean is that the <wd/> elements are two children off each <article/> - i.e. <article/><p/><wd/>. If my xquery doesn't care about the <wd/> and the <p/> elements - why is it still affected by them?
Thanks.
From: christian.gruen@gmail.com Date: Tue, 5 Jul 2016 17:52:40 +0200 Subject: Re: [basex-talk] Improving performance in a 40GB database To: james.hn.sears@outlook.com CC: basex-talk@mailman.uni-konstanz.de
Hi James,
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
Some (more or less obvious) questions back:
- How large is the resulting XML document (around 15% of the original document)?
- How do you measure the time?
- Do you store the result on disk?
- How long does your query take if you wrap it into a count(...) or
prof:void(...) function?
Thanks in advance, Christian
Hi James
Reducing the problem space means the following xquery - run in basexgui 8.5
- has gone from an average of 148000ms to 3900ms:
…sounds good. And thanks for the details.
However, doesn't this show an issue in how BaseX maintains it's indexes?
BaseX has no inverted indexes for element and attribute nodes (see [1] for more information on our index structures). In most applications, this works pretty well, because we’ve spent a lot of effort in speeding up sequential access to document structures, but for 40 GB, too much data temporarily moved to main-memory.
Hope this helps Christian
On Wed, 2016-07-06 at 16:02 +0100, James Sears wrote:
let $void := prof:void(for $book in //book
Are you able to make that //book more specific? or can book elements occur at any level? BaseX is possibly fetching every element node in the database to see if it's an element of type book or not.
or can you look at the generated query plan in more detail?
Liam
Hi Liam - at the moment, I'm expecting our most common user case will be to search across all the books. It's not possible to be more specific until later on in the query.
In fact, it's likely that specific filtering would be via an export of the xml that gets imported into excel - where standard filters would be applied by the (non technical) user.
Subject: Re: [basex-talk] Improving performance in a 40GB database From: liam@w3.org To: james.hn.sears@outlook.com CC: basex-talk@mailman.uni-konstanz.de Date: Wed, 6 Jul 2016 14:56:43 -0400
On Wed, 2016-07-06 at 16:02 +0100, James Sears wrote:
let $void := prof:void(for $book in //book
Are you able to make that //book more specific? or can book elements occur at any level? BaseX is possibly fetching every element node in the database to see if it's an element of type book or not.
or can you look at the generated query plan in more detail?
Liam -- Liam R. E. Quin liam@w3.org The World Wide Web Consortium (W3C)
Hi James,
did you see: http://docs.basex.org/wiki/Index#Selective_Indexing
Otherwise for large amounts of data I found it helpful to create another database with only contents that need to be full text searchable or otherwise help to limit the scope of the data. Then, with more specific information (e.g. ttle or category names etc.) fetch the data from the main database.
As well, databases in BaseX are really lightweight, so you could create multiple dbs and spread the 40 GB over several dbs. Then, query multiple dbs at once. Could be faster than querying one large db (and updating is much faster as well).
Regards,
Max
2016-07-07 10:05 GMT+02:00 James Sears james.hn.sears@outlook.com:
Hi Liam - at the moment, I'm expecting our most common user case will be to search across all the books. It's not possible to be more specific until later on in the query.
In fact, it's likely that specific filtering would be via an export of the xml that gets imported into excel - where standard filters would be applied by the (non technical) user.
Subject: Re: [basex-talk] Improving performance in a 40GB database From: liam@w3.org To: james.hn.sears@outlook.com CC: basex-talk@mailman.uni-konstanz.de Date: Wed, 6 Jul 2016 14:56:43 -0400
On Wed, 2016-07-06 at 16:02 +0100, James Sears wrote:
let $void := prof:void(for $book in //book
Are you able to make that //book more specific? or can book elements occur at any level? BaseX is possibly fetching every element node in the database to see if it's an element of type book or not.
or can you look at the generated query plan in more detail?
Liam
-- Liam R. E. Quin liam@w3.org The World Wide Web Consortium (W3C)
Hi Max - you have given me food for thought.
I was not aware of the selective indexing and will investigate.
Partitioning might well be the answer - I'll ask around and see if the people close to the data have some ideas about how data might be logically partitioned. Maybe having a master database - without the elements - "pointing" at various database instance partitions is the answer?
Thanks for the ideas.
From: Date: Fri, 8 Jul 2016 15:24:16 +0200 Subject: Re: [basex-talk] Improving performance in a 40GB database To: james.hn.sears@outlook.com CC: basex-talk@mailman.uni-konstanz.de
Hi James,
did you see: http://docs.basex.org/wiki/Index#Selective_Indexing
Otherwise for large amounts of data I found it helpful to create another database with only contents that need to be full text searchable or otherwise help to limit the scope of the data. Then, with more specific information (e.g. ttle or category names etc.) fetch the data from the main database.
As well, databases in BaseX are really lightweight, so you could create multiple dbs and spread the 40 GB over several dbs. Then, query multiple dbs at once. Could be faster than querying one large db (and updating is much faster as well).
Regards,
Max
Hi James,
if you are able to post some more information about your data model, I am sure this list can come up with more helpful pointers.
It is always about trade-offs, e.g. flexible data model vs performance.
Regards,
Max
2016-07-11 9:58 GMT+02:00 James Sears james.hn.sears@outlook.com:
Hi Max - you have given me food for thought. I was not aware of the selective indexing and will investigate. Partitioning might well be the answer - I'll ask around and see if the people close to the data have some ideas about how data might be logically partitioned. Maybe having a master database - without the elements - "pointing" at various database instance partitions is the answer? Thanks for the ideas.
From: > Date: Fri, 8 Jul 2016 15:24:16 +0200 > Subject: Re: [basex-talk] Improving performance in a 40GB database > To: james.hn.sears@outlook.com > CC: basex-talk@mailman.uni-konstanz.de > > Hi James, > > did you see: http://docs.basex.org/wiki/Index#Selective_Indexing > > Otherwise for large amounts of data I found it helpful to create > another database with only contents that need to be full text > searchable or otherwise help to limit the scope of the data. Then, > with more specific information (e.g. ttle or category names etc.) > fetch the data from the main database. > > As well, databases in BaseX are really lightweight, so you could > create multiple dbs and spread the 40 GB over several dbs. Then, query > multiple dbs at once. Could be faster than querying one large db (and > updating is much faster as well). > > > Regards, > > Max
basex-talk@mailman.uni-konstanz.de