Re: [basex-talk] general question on basex database/collection design

31 Aug 2010


      ...
Hi Sandra,
...
[snip]
...
Some questions, just out of curiosity:
� how much XML data (mb/gb) do you currently work with?
Currently my largest db looks like this:
 Size: 4012 MB
 Nodes: 61552395
 Height: 8
 Input Size: 913 MB
 Encoding: UTF-8
 Documents: 30
 Whitespace Chopping: ON
 Entity Parsing: OFF
+All indexes.
Its kind of a "pigpen" experimental database - no design at all. I am
currently putting together a smaller properly designed "production" db
designed for online public queries:
Size: 1067 MB
 Nodes: 36282594
 Documents: 23
I would anticipate the volume of data for researchers growing to be 2-4
times that size over the next few years but less growth on the public
database.
...
� how much time needs BaseX in your context for update operations and
the optimize command?
I've barely touched xquery update. I load up as needed from files; add/del
documents as I need to. Optimising indexes takes a few minutes. The big
problem with this is that online queries stop while this happens. We are
transitioning from dev to production so I have to solve this problem. I am
thinking of doing nightly updates/rebuilds/reindexing on a separate VM
(for big updates), then just pushing the BaseXData/db dir via rsync and 
stopping/restarting the query basex server with the new database in place
of the old. That should not cause any disruption to users. I will use
xquery update for the small volume of daily updates and have these update
a separate small database which won't need indexes. Hence my excitement at
being able to integrate these little changes into the larger online query
database results.
...
� do you already have encountered performance limits in everyday use,
Performance is very very good with well written online queries. Within a
few weeks our search interface will go public and you can see for yourself
-- I will let you know.
However, I have encountered rather bad problems with long and complex
queries with a lot of output. With basex these jobs would just kind of
"hang". I think there was an issue, as I recall, with its serialiser -- no
output at all until the query completed, so I was running out of ram.
Saxon serialisation proved much better for this kind of work because I
could track the progress of these jobs by tailing the output as it ran. My
apologies for not reporting this at the time (2-3 months back), I should
have done so. Since then I note some basex changes in serialiser options
but I have not tried again to see if this problem is solved. It was very
unfortunate because I could not use the fuzzy/ft querying in these large
person name matching jobs, but I do make this facility available to our
researchers in online queries.
I am struggling with a very strange memory bug right now when loading but
I suspect it could be an issue with my perl client -- let me report that
to you separately from this reply.
...
or do you rather try to prevent potential bottlenecks in future?
I like to prevent!
...
In another real-life scenario that might be similar to yours, BaseX is
used as backend for a library database with 2 Mio. titles (~1 GB of XML
data). The process of updating the data and recreating the indexes is
applied once a day/night and takes appr. 2 minutes.
This also aligns with my experience. A complete reload/index build is
about 3 mins for us on a 8gb VM on a dell server. I give the JVM ~ 4gb via
-Xms and -Mx. Not terribly painful but as I mention above, I will need to
pull a few tricks when we are running a public online search -- a 3 minute
outage for updating is unnacceptable.
...
�and thanks for always giving instructive feedback!
Hope this helps.
Sandra

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] general question on basex database/collection design