Hi Sandra,
[snip]
Some questions, just out of curiosity: � how much XML data (mb/gb) do you currently work with?
Currently my largest db looks like this: Size: 4012 MB Nodes: 61552395 Height: 8 Input Size: 913 MB Encoding: UTF-8 Documents: 30 Whitespace Chopping: ON Entity Parsing: OFF +All indexes.
Its kind of a "pigpen" experimental database - no design at all. I am currently putting together a smaller properly designed "production" db designed for online public queries:
Size: 1067 MB Nodes: 36282594 Documents: 23
I would anticipate the volume of data for researchers growing to be 2-4 times that size over the next few years but less growth on the public database.
� how much time needs BaseX in your context for update operations and
the optimize command?
I've barely touched xquery update. I load up as needed from files; add/del documents as I need to. Optimising indexes takes a few minutes. The big problem with this is that online queries stop while this happens. We are transitioning from dev to production so I have to solve this problem. I am thinking of doing nightly updates/rebuilds/reindexing on a separate VM (for big updates), then just pushing the BaseXData/db dir via rsync and stopping/restarting the query basex server with the new database in place of the old. That should not cause any disruption to users. I will use xquery update for the small volume of daily updates and have these update a separate small database which won't need indexes. Hence my excitement at being able to integrate these little changes into the larger online query database results.
� do you already have encountered performance limits in everyday use,
Performance is very very good with well written online queries. Within a few weeks our search interface will go public and you can see for yourself -- I will let you know.
However, I have encountered rather bad problems with long and complex queries with a lot of output. With basex these jobs would just kind of "hang". I think there was an issue, as I recall, with its serialiser -- no output at all until the query completed, so I was running out of ram. Saxon serialisation proved much better for this kind of work because I could track the progress of these jobs by tailing the output as it ran. My apologies for not reporting this at the time (2-3 months back), I should have done so. Since then I note some basex changes in serialiser options but I have not tried again to see if this problem is solved. It was very unfortunate because I could not use the fuzzy/ft querying in these large person name matching jobs, but I do make this facility available to our researchers in online queries.
I am struggling with a very strange memory bug right now when loading but I suspect it could be an issue with my perl client -- let me report that to you separately from this reply.
or do you rather try to prevent potential bottlenecks in future?
I like to prevent!
In another real-life scenario that might be similar to yours, BaseX is
used as backend for a library database with 2 Mio. titles (~1 GB of XML data). The process of updating the data and recreating the indexes is applied once a day/night and takes appr. 2 minutes.
This also aligns with my experience. A complete reload/index build is about 3 mins for us on a 8gb VM on a dell server. I give the JVM ~ 4gb via -Xms and -Mx. Not terribly painful but as I mention above, I will need to pull a few tricks when we are running a public online search -- a 3 minute outage for updating is unnacceptable.
�and thanks for always giving instructive feedback!
Hope this helps.
Sandra