At one point in the past there were limits to how many unique element names could be stored/indexed in the database. We exceeded that limit for our documents and so to address the problem we started splitting out our data into multiple databases and using some hacky rewrites of the QueryContext class to work with them as if they were in one database. We haven't synced up in a while and the BaseX API and class structure has undergone some really good improvement in the meantime. I'm in the processing of revising how we interface with and use BaseX and would like to consider going back to a single database if possible.
In general, the question is: does a limit to the number of unique element/attribute names still exist? If so, what is it?
Time permitting (it appears you guys have been busy pushing out great new features recently) I think a Wiki page with a list of all limits on the database would be very helpful (I.e., limited to X number of elements, limited to Y number of attributes per element, limited to Z size on disk, etc.)
Thanks!
-- Dave Glick | dglick@dracorp.commailto:dglick@dracorp.com | 703-299-0700 x212 Data Research and Analysis Corp. | www.dracorp.comhttp://www.dracorp.com
Hi Dave,
thanks for your e-mail. The number of distinct element names is currently limited to 2^15 - 1 (32767). If I remember correctly, the old limited was 256, so I hope that will be enough for your use case (...do you know how many element names are used in your XML nstances?)
Christian ___________________________
On Tue, Jul 5, 2011 at 4:29 PM, Dave Glick dglick@dracorp.com wrote:
At one point in the past there were limits to how many unique element names could be stored/indexed in the database. We exceeded that limit for our documents and so to address the problem we started splitting out our data into multiple databases and using some hacky rewrites of the QueryContext class to work with them as if they were in one database. We haven’t synced up in a while and the BaseX API and class structure has undergone some really good improvement in the meantime. I’m in the processing of revising how we interface with and use BaseX and would like to consider going back to a single database if possible.
In general, the question is: does a limit to the number of unique element/attribute names still exist? If so, what is it?
Time permitting (it appears you guys have been busy pushing out great new features recently) I think a Wiki page with a list of all limits on the database would be very helpful (I.e., limited to X number of elements, limited to Y number of attributes per element, limited to Z size on disk, etc.)
Thanks!
--
Dave Glick | dglick@dracorp.com | 703-299-0700 x212
Data Research and Analysis Corp. | www.dracorp.com
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Christian,
Thanks for the feedback. Unfortunately, that's the same limit as when I originally implemented our interface and is a little bit too low for our use. Our current use case is to "augment" text content (think something along the lines of source code) with mixed-mode XML for processing. Extending the source code example further, our tool adds XML elements around all relevant content such as method names, variable names, etc. which then lets us query and manipulate the flat textual content. Because the names of these things are not static, we need to support as many element names as the original content requires. It works fine for the first hundred or so input files, but around that point we start running out of unique names. Our current usages consume between 200 and 300 input files, so we'll still hit the 2^15 unique name limit.
The multiple databases approach works, so I'll continue using it. It's just challenging to maintain. The main problem is that the internal BaseX API is really designed around having one database open at a time - or at least it was. In my recollection, while you can obviously query multiple databases using collections and other XQuery functions you've built in, the Context is designed to have one of them appear as "more important" than the others. However, my understanding was never all that good and/or this may have changed since I last looked at it in-depth.
Can you help me understand the current relationship between Context.datas, Context.data, and querying. I see that when I open a new database using cmd.Open, the Data instance is added to the Context.datas collection. However, in other cases such as cmd.CreateDB the Context.data single Data reference is modified and the Context.datas collection is not. Querying also appears to place emphasis on the single Context.data reference, especially when constructing the default query context. Even the Wiki documentation on commands is a little unclear whether BaseX is operating on multiple databases simultaneously with, for example, the text for SHOW DATABASES reading "Shows all databases that are opened..." and INFO STORAGE reading "...currently opened database".
Hopefully this wasn't too "down in the weeds" for the rest of the list...
Thanks,
Dave
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Tuesday, July 05, 2011 11:30 AM To: Dave Glick Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Database Limits
Hi Dave,
thanks for your e-mail. The number of distinct element names is currently limited to 2^15 - 1 (32767). If I remember correctly, the old limited was 256, so I hope that will be enough for your use case (...do you know how many element names are used in your XML nstances?)
Christian ___________________________
On Tue, Jul 5, 2011 at 4:29 PM, Dave Glick dglick@dracorp.com wrote:
At one point in the past there were limits to how many unique element names could be stored/indexed in the database. We exceeded that limit for our documents and so to address the problem we started splitting out our data into multiple databases and using some hacky rewrites of the QueryContext class to work with them as if they were in one database. We haven't synced up in a while and the BaseX API and class structure has undergone some really good improvement in the meantime. I'm in the processing of revising how we interface with and use BaseX and would like to consider going back to a single database if possible.
In general, the question is: does a limit to the number of unique element/attribute names still exist? If so, what is it?
Time permitting (it appears you guys have been busy pushing out great new features recently) I think a Wiki page with a list of all limits on the database would be very helpful (I.e., limited to X number of elements, limited to Y number of attributes per element, limited to Z size on disk, etc.)
Thanks!
--
Dave Glick | dglick@dracorp.com | 703-299-0700 x212
Data Research and Analysis Corp. | www.dracorp.com
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Our current usages consume between 200 and 300 input files, so we'll still hit the 2^15 unique name limit.
Thanks; to get an even better impression: can you estimate how many distinct names you'll get in total? I remember one other use case in which we had to handle around 25,000 names (in most other cases, there are <100 names, which is why we still have that limit).
The multiple databases approach works, so I'll continue using it. It's just challenging to maintain. The main problem is that the internal BaseX API is really designed around having one database open at a time - or at least it was.
Yes, it still is.. and it might change at some time in future, but not probably not this year. Instead, we'd rather continue to extend the current limits to support TB-scale database instances (including more generous limits for element names).
In my recollection, while you can obviously query multiple databases using collections and other XQuery functions you've built in, the Context is designed to have one of them appear as "more important" than the others. However, my understanding was never all that good and/or this may have changed since I last looked at it in-depth.
It's pretty good indeed; the main database context was introduced to simplify database operations on command level, whereas XQuery allows you to access an arbitrary number of databases within the scope of a single query. With BaseX 6.7.1. or 6.8, we'll introduce additional custom XQuery db:...()-functions, which can then be used to perform batch operations on several databases.
Can you help me understand the current relationship between Context.datas, Context.data, and querying.
Context.datas is important in the client/server context: each client is allowed to open its own database, and the Context.datas object remembers how often each database has been referenced (pinned). In the standalone/embedded context, each database will be pinned at most once.
Hope this helps, Christian
...to get an even better impression: can you estimate how many distinct names you'll get in total?
I can do better than estimate :) After actually measuring, the data set we most often work with currently contains 258 content files with 3,524 unique element names. Obviously, this is an order of magnitude less than 2^15-1, so I must have been remembering the good old days when the limit really was 256 and we blew right past that. Practically, it also means that with the new unique element name limit I should be fine re-engineering to a single-database approach (yay!).
In general I agree with the direction of focusing on robust support for a single database. Ours was (and obviously no longer is) an edge case. As long as multiple documents can be stored in a single database and the limiting factors of that database exceed reasonable thresholds, I really don't see many use cases like ours in the future where multiple simultaneous database accesses would be required.
Thanks also for the insight into Context Data references - I've never really had that straight in my mind, but it makes sense now given the possibility to operate BaseX as a server with the potential for lots of different clients with lots of simultaneous database connections to them. That's totally outside our own use case though, so I'll just continue to ignore the multiple Data reference collection for now.
As always, thanks for the stellar feedback and support. I'm excited to integrate all the new stuff you've been adding in the last several months.
Dave
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Tuesday, July 05, 2011 12:57 PM To: Dave Glick Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Database Limits
Our current usages consume between 200 and 300 input files, so we'll still hit the 2^15 unique name limit.
Thanks; to get an even better impression: can you estimate how many distinct names you'll get in total? I remember one other use case in which we had to handle around 25,000 names (in most other cases, there are <100 names, which is why we still have that limit).
The multiple databases approach works, so I'll continue using it. It's just challenging to maintain. The main problem is that the internal BaseX API is really designed around having one database open at a time - or at least it was.
Yes, it still is.. and it might change at some time in future, but not probably not this year. Instead, we'd rather continue to extend the current limits to support TB-scale database instances (including more generous limits for element names).
In my recollection, while you can obviously query multiple databases using collections and other XQuery functions you've built in, the Context is designed to have one of them appear as "more important" than the others. However, my understanding was never all that good and/or this may have changed since I last looked at it in-depth.
It's pretty good indeed; the main database context was introduced to simplify database operations on command level, whereas XQuery allows you to access an arbitrary number of databases within the scope of a single query. With BaseX 6.7.1. or 6.8, we'll introduce additional custom XQuery db:...()-functions, which can then be used to perform batch operations on several databases.
Can you help me understand the current relationship between Context.datas, Context.data, and querying.
Context.datas is important in the client/server context: each client is allowed to open its own database, and the Context.datas object remembers how often each database has been referenced (pinned). In the standalone/embedded context, each database will be pinned at most once.
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de