Hi all,
We've been using BaseX for several years now and have constantly been skirting around our primary use case: using BaseX in an embedded mode. What I mean by this is using BaseX in-process in an application without running any kind of client/server communication bridge and with very direct access to BaseX primitives. There are several reasons for wanting to do this including performance (which seems to be the subject of recent discussions, I.e., running the server in "local" mode). My own primary reason is to gain more direct access to the database objects. For example, we routinely have a need to:
- Directly access and traverse database nodes by climbing, descending, following, etc. - Insert or remove content at a specific database node - Store references to individual nodes (I.e., using its "pre" and "index" value) - Fine-tune queries in order to set context, external functions, etc.
While many of these operations can indeed be performed through the existing client/server interface, it's less friendly - especially when doing things like asking for the next sibling of a given node. With a direct embedded API you just get the next node, bypassing the XQuery processor altogether. From my current work in this area, I think BaseX is already "primed" for this kind of API - 90% or more of the code is already in place since most of the primitives already expose common methods for use by database commands, XQuery processor, etc. All that should be needed is to expose this functionality in a stable and complete API.
Good examples of applications that may need this kind of API include media players (I.e., for storage of the media library data), simple stand-alone database applications, etc. Until recently, we've been able to adapt BaseX to fit our needs by writing a thin wrapper layer that interfaces with the appropriate BaseX classes. However, with the rapid pace of BaseX development these days it's becoming increasingly difficult to track each release since we rely on aspects of the BaseX codebase that are not really intended for public consumption and thus keep changing. This brings up a couple questions:
- Are we the only ones interested in a direct embedded interface? - Does the BaseX team have any plans to implement such an interface? - Would such an interface be better implemented by the BaseX team (as opposed to a third party)?
I don't mind doing some work in this area, however, I have some concerns about doing so. Primarily, given that the whole idea would be to make direct integration easier and more stable it seems like the structure and layout of the classes in the embedded API and the ways that they interact with the underlying BaseX objects should probably be determined by the BaseX team. The danger is that someone outside the team spends effort creating such an interface only to do things in a way that's either not preferred or difficult to maintain as the core team continues to improve the overall product.
Hopefully this was clear... Thoughts?
Dave
Hi Dave, hi all,
better Java APIs for BaseX - yes, that's a very relevant topic nowadays, something that we've frequently been discussing for the last weeks in our team. And the main challenge we are struggling with is that there are just too many ways how such an API could look like - and too many incoming requests that can hardly be bundled in one single API.. Here are some of the requirements we're dealing with, and the approaches that could be pursued (..and I already know which of them you would prefer ;) :
* a new Command and Query/Result API could enhance/replace the existing light-weight client Java API, and the representation of results would be separated from the low-level data structures in BaseX. This API could be used in the client/server architecture as well, but it would introduce some overhead, as all the data structures would have to be replicated by the client.
* The new Command, Query and Result objects could also be made serializable. This way, they could be easily transfered over the network, and there would be no need to develop custom binary protocols.
* a real embedded API could ensure that developers do not suffer from frequent changes in our query and storage backend. Instead, we would ensure that the API does not change as long as the major version is not updated. This API would be much more efficient than a client/server API, but we might have to put more work into transactional issues.
* the existing XML:DB and XQJ APIs could be revised and updated to support the client/server architecture. This could reduce the need for any other client/server-based API with a richer functionality.
Everyone who is interested in more powerful APIs.. Please speak out! The more feedback we get, the better we'll be able to design our APIs. And of course we're interested in volunteers out there... Last but not least, this is an Open Source and community project ;)
@Dave: I've recently added a minimum query API for the QT3TS, Michael Kay's new W3 XQuery Test Suite. Both the test suite driver and the mini API (qt3api) is still work in progress:
https://github.com/BaseXdb/basex-tests/tree/master/src/main/java/org/basex/t...
It it not low-level enough to directly support any axis or update operations; instead new QueryProcessor instances are created to perform queries on intermediate nodes. It would be great if you could have a look at this API, and it would then be interesting to know more about your performance requirements: do you think that the overhead for parsing and compiling query expressions (which usually does not takes longer than some microseconds, and is often faster than the actual axis traversals) will be too expensive in your scenario?
If you believe that this framework would be sufficient, we could start to enhance it, make it safe for concurrent access, document it, etc. If you need to work with the PRE and ID values of database nodes, e.g., you could take advantage of the db: functions of BaseX [1]:
Output: db:node-id($node) Input: db:open-id($db, $id)
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Database_Functions
On Mon, Nov 14, 2011 at 6:26 PM, Dave Glick dglick@dracorp.com wrote:
Hi all,
We’ve been using BaseX for several years now and have constantly been skirting around our primary use case: using BaseX in an embedded mode. What I mean by this is using BaseX in-process in an application without running any kind of client/server communication bridge and with very direct access to BaseX primitives. There are several reasons for wanting to do this including performance (which seems to be the subject of recent discussions, I.e., running the server in “local” mode). My own primary reason is to gain more direct access to the database objects. For example, we routinely have a need to:
- Directly access and traverse database nodes by climbing, descending,
following, etc.
Insert or remove content at a specific database node
Store references to individual nodes (I.e., using its “pre” and “index”
value)
- Fine-tune queries in order to set context, external functions, etc.
While many of these operations can indeed be performed through the existing client/server interface, it’s less friendly – especially when doing things like asking for the next sibling of a given node. With a direct embedded API you just get the next node, bypassing the XQuery processor altogether. From my current work in this area, I think BaseX is already “primed” for this kind of API – 90% or more of the code is already in place since most of the primitives already expose common methods for use by database commands, XQuery processor, etc. All that should be needed is to expose this functionality in a stable and complete API.
Good examples of applications that may need this kind of API include media players (I.e., for storage of the media library data), simple stand-alone database applications, etc. Until recently, we’ve been able to adapt BaseX to fit our needs by writing a thin wrapper layer that interfaces with the appropriate BaseX classes. However, with the rapid pace of BaseX development these days it’s becoming increasingly difficult to track each release since we rely on aspects of the BaseX codebase that are not really intended for public consumption and thus keep changing. This brings up a couple questions:
Are we the only ones interested in a direct embedded interface?
Does the BaseX team have any plans to implement such an interface?
Would such an interface be better implemented by the BaseX team (as
opposed to a third party)?
I don’t mind doing some work in this area, however, I have some concerns about doing so. Primarily, given that the whole idea would be to make direct integration easier and more stable it seems like the structure and layout of the classes in the embedded API and the ways that they interact with the underlying BaseX objects should probably be determined by the BaseX team. The danger is that someone outside the team spends effort creating such an interface only to do things in a way that’s either not preferred or difficult to maintain as the core team continues to improve the overall product.
Hopefully this was clear... Thoughts?
Dave
Dave,
I have been working on an Android port of BaseX. My Android BaseX service bypasses all of the client/server code which is included in BaseX. I am only working with the context, core objects and commands.
I believe it's already very easy to make an "embedded" version of BaseX if you look at BaseX more as a framework rather than just a database/server. Granted, the APIs you are coding to are basically "internal" BaseX APIs, but if you are looking for low-level, you need to get right into the heart of BaseX.
If you want an example of how this is done, look no further than the BaseX GUI "client", which is not really a client but a GUI app that integrates the core of BaseX into the application. At least, that is how I see it.
Regards, Eric
On Mon, Nov 14, 2011 at 7:55 PM, Christian Grün christian.gruen@gmail.comwrote:
Hi Dave, hi all,
better Java APIs for BaseX - yes, that's a very relevant topic nowadays, something that we've frequently been discussing for the last weeks in our team. And the main challenge we are struggling with is that there are just too many ways how such an API could look like - and too many incoming requests that can hardly be bundled in one single API.. Here are some of the requirements we're dealing with, and the approaches that could be pursued (..and I already know which of them you would prefer ;) :
- a new Command and Query/Result API could enhance/replace the
existing light-weight client Java API, and the representation of results would be separated from the low-level data structures in BaseX. This API could be used in the client/server architecture as well, but it would introduce some overhead, as all the data structures would have to be replicated by the client.
- The new Command, Query and Result objects could also be made
serializable. This way, they could be easily transfered over the network, and there would be no need to develop custom binary protocols.
- a real embedded API could ensure that developers do not suffer from
frequent changes in our query and storage backend. Instead, we would ensure that the API does not change as long as the major version is not updated. This API would be much more efficient than a client/server API, but we might have to put more work into transactional issues.
- the existing XML:DB and XQJ APIs could be revised and updated to
support the client/server architecture. This could reduce the need for any other client/server-based API with a richer functionality.
Everyone who is interested in more powerful APIs.. Please speak out! The more feedback we get, the better we'll be able to design our APIs. And of course we're interested in volunteers out there... Last but not least, this is an Open Source and community project ;)
@Dave: I've recently added a minimum query API for the QT3TS, Michael Kay's new W3 XQuery Test Suite. Both the test suite driver and the mini API (qt3api) is still work in progress:
https://github.com/BaseXdb/basex-tests/tree/master/src/main/java/org/basex/t...
It it not low-level enough to directly support any axis or update operations; instead new QueryProcessor instances are created to perform queries on intermediate nodes. It would be great if you could have a look at this API, and it would then be interesting to know more about your performance requirements: do you think that the overhead for parsing and compiling query expressions (which usually does not takes longer than some microseconds, and is often faster than the actual axis traversals) will be too expensive in your scenario?
If you believe that this framework would be sufficient, we could start to enhance it, make it safe for concurrent access, document it, etc. If you need to work with the PRE and ID values of database nodes, e.g., you could take advantage of the db: functions of BaseX [1]:
Output: db:node-id($node) Input: db:open-id($db, $id)
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Database_Functions
On Mon, Nov 14, 2011 at 6:26 PM, Dave Glick dglick@dracorp.com wrote:
Hi all,
We’ve been using BaseX for several years now and have constantly been skirting around our primary use case: using BaseX in an embedded mode.
What
I mean by this is using BaseX in-process in an application without
running
any kind of client/server communication bridge and with very direct
access
to BaseX primitives. There are several reasons for wanting to do this including performance (which seems to be the subject of recent
discussions,
I.e., running the server in “local” mode). My own primary reason is to
gain
more direct access to the database objects. For example, we routinely
have a
need to:
- Directly access and traverse database nodes by climbing, descending,
following, etc.
Insert or remove content at a specific database node
Store references to individual nodes (I.e., using its “pre” and “index”
value)
- Fine-tune queries in order to set context, external functions, etc.
While many of these operations can indeed be performed through the
existing
client/server interface, it’s less friendly – especially when doing
things
like asking for the next sibling of a given node. With a direct embedded
API
you just get the next node, bypassing the XQuery processor altogether.
From
my current work in this area, I think BaseX is already “primed” for this kind of API – 90% or more of the code is already in place since most of
the
primitives already expose common methods for use by database commands, XQuery processor, etc. All that should be needed is to expose this functionality in a stable and complete API.
Good examples of applications that may need this kind of API include
media
players (I.e., for storage of the media library data), simple stand-alone database applications, etc. Until recently, we’ve been able to adapt
BaseX
to fit our needs by writing a thin wrapper layer that interfaces with the appropriate BaseX classes. However, with the rapid pace of BaseX
development
these days it’s becoming increasingly difficult to track each release
since
we rely on aspects of the BaseX codebase that are not really intended for public consumption and thus keep changing. This brings up a couple questions:
Are we the only ones interested in a direct embedded interface?
Does the BaseX team have any plans to implement such an interface?
Would such an interface be better implemented by the BaseX team (as
opposed to a third party)?
I don’t mind doing some work in this area, however, I have some concerns about doing so. Primarily, given that the whole idea would be to make
direct
integration easier and more stable it seems like the structure and
layout of
the classes in the embedded API and the ways that they interact with the underlying BaseX objects should probably be determined by the BaseX team. The danger is that someone outside the team spends effort creating such
an
interface only to do things in a way that’s either not preferred or difficult to maintain as the core team continues to improve the overall product.
Hopefully this was clear... Thoughts?
Dave
Christian,
Thank you, as always, for your thoughtful and thorough response. I'm excited to hear that this is already being discussed around BaseX world headquarters. Several of your possible approaches raise interesting questions or capabilities that I hadn't considered, especially given my somewhat myopic view of the situation based on our own requirements. As you guessed, my own interest is in directly exposing the underlying data structures in a more stable and cohesive way. I've looked in-depth at the existing APIs (DOM, XML:DB, and XQJ) and while the development idealist in me wants to use and champion standard protocols, the pragmatist that needs to get a job done accepts that such standards tend to support the lowest common denominator and thus fail to capture more complex or specific use cases (though I still applaud you guys for providing and supporting such a broad range of APIs, which are certainly valuable in many situations). While the existing APIs do provide a lot of functionality, they're not consistent (use DOM for traversing nodes, XQJ for querying, etc.) and there are gaps in functionality (creation and manipulation of databases, direct updating, etc.). In any case, the existence of these other APIs has provided a lot of sample code that we've adapted in our own embedded API.
@ Eric - you mentioned that you're working on an Android port and have taken the viewpoint that BaseX is a framework on which other tools can be built. That's very much in line with how we've been treating it. For a little background, we've been using BaseX to support a variety of projects, the largest of which is a generic RCP for the manipulation of modeling and simulation data (think Eclipse for M&S). We take inputs to models and simulations, parse them up into XML, and then store, query, and manipulate them in BaseX with lots of frontend visualizations and other graphical interfaces. The challenging part of our work is that we use .NET and Mono, so we cross-compile BaseX using IKVM and then wrap it to make it easier to work with in .NET. Since all of our work uses BaseX as an embedded database, we've found that we're basically developing something akin to an embedded API in our wrapper. Rather than create such an API inside the .NET wrapper, I'd much prefer the API existed in BaseX itself and the wrapper could be "thinner". The biggest problem we have is keeping up with changes to the classes in BaseX. It's being released and improved on such a rapid cycle (great job!) that we're constantly reworking portions of our API to match, and frequently get multiple releases behind.
Perhaps it would help if I shared some of our own use cases. Obviously we're not the only potential users of an embedded API, but some of the issues we've had using it in that mode may help refine requirements for an eventual implementation.
* Node referencing and persistence
We interact with the database a lot. The general pattern is that a given visualization or view in our tool will execute a query to return a result set and will then store references to the resultant nodes. The application provides a multiple document interface and each view on a given document can potentially manipulate the database. This is why we store references to each node. Some of the queries are rather time consuming and rather than re-evaluate them each time the data is changed in other views, we simply persist the node references through changes.
This was one of the harder areas to get right in our API. Because BaseX is primarily client/server driven and focused on returning serialized results, the notion of node references and persistence just isn't there. Internally, BaseX stores a pre value inside it's node objects. This is fine until the database changes and something gets inserted or deleted and the pre values change. At that point, the previous pre value no longer refers to the same node and any objects that rely on it are invalidated. This works great for all the current use cases because there's no need for persistence in the node references following database updates. However, in an embedded environment, there are lots of reasons why node references should be persisted. To solve this we used the following algorithm:
- We initialize node objects with a Data and pre value, just like the internal DBNode class. - We immediately fetch the associated (and immutable) index for the pre on construction. - We also store the timestamp for the most recent database modification. - Every time an operation is performed on a node object, it first checks to see if the saved timestamp is different than the database one, and if so it updates the pre value by using the stored index value and the new timestamp is saved. - If the index no longer exists (the node was removed) the object is invalidated.
This ensures that a reference to a node object continues to reference the "same" node no matter what happens to the database.
* Node traversal
We traverse nodes in our application. A lot. There are many cases where an operation needs to be performed dynamically and it's much easier to manipulate the database directly than attempt to construct an XQuery that captures the desired behavior and execute that. Not to mention that we've realized significant performance gains by bypassing the query engine and relying on direct manipulation for operations that perform large numbers of traversal and modification operations (sometimes in the 10,000 or 100,000 node order of magnitude). In fact, this is already captured in the current DOM API implementation. I would just suggest that any future embedded API also make sure to provide direct and efficient node traversal operations.
* Direct updating
If you can store a reference to nodes, and you can traverse them to get to other nodes, it stands to reason that you should be able to change them as well. This is another area where we've spent significant effort creating some capability that didn't already exist. In our own node class we provide operations for setting inner text (or atomic value if you prefer), inner XML, and outer XML (though that last one is kind of a hack since we just remove the node in question and insert new content as a child of the parent at the appropriate location). Again, this is an area we noticed performance gains over using XQuery update when appropriate. That and we implemented a lot of the logic before XQuery update was even supported in BaseX (though we've had to change it significantly to keep up). This seems like another area where a complete embedded API should provide some capability.
* Query manipulation
I think XQJ is on the right track here. Ideally, the notion of a query should be captured in an object and exposed so that the context can be manipulated, collections can be created, variables can be created/initialized, etc. I'd just like to see the eventual embedded API include this functionality directly integrated with the other parts of the embedded API.
I would be happy to download and start looking at QT3TS - I respect Michael's work on Saxon and am interested to see what his test suite looks like. I'd also be happy to run some benchmarking and performance comparisons on direct manipulation vs. XQuery evaluation for various operations in order to get some hard numbers. Anecdotally, my general experience has been that directly manipulating the database through primitives is much more efficient for large numbers of operations or situations where the logic is very dynamic or complex and an XQuery statement would be long or difficult to derive.
This email ended up being a bit longer than I had intended - but that always seems to happen with me :). I hope of my stream-of-consciousness was valuable. Let's keep up this discussion...I'll write more as I think of it.
Dave
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Monday, November 14, 2011 7:56 PM To: Dave Glick Cc: BaseX-Talk@mailman.uni-konstanz.de; Eric Murphy; Charles Foster Subject: Re: [basex-talk] Embedded API
Hi Dave, hi all,
better Java APIs for BaseX - yes, that's a very relevant topic nowadays, something that we've frequently been discussing for the last weeks in our team. And the main challenge we are struggling with is that there are just too many ways how such an API could look like - and too many incoming requests that can hardly be bundled in one single API.. Here are some of the requirements we're dealing with, and the approaches that could be pursued (..and I already know which of them you would prefer ;) :
* a new Command and Query/Result API could enhance/replace the existing light-weight client Java API, and the representation of results would be separated from the low-level data structures in BaseX. This API could be used in the client/server architecture as well, but it would introduce some overhead, as all the data structures would have to be replicated by the client.
* The new Command, Query and Result objects could also be made serializable. This way, they could be easily transfered over the network, and there would be no need to develop custom binary protocols.
* a real embedded API could ensure that developers do not suffer from frequent changes in our query and storage backend. Instead, we would ensure that the API does not change as long as the major version is not updated. This API would be much more efficient than a client/server API, but we might have to put more work into transactional issues.
* the existing XML:DB and XQJ APIs could be revised and updated to support the client/server architecture. This could reduce the need for any other client/server-based API with a richer functionality.
Everyone who is interested in more powerful APIs.. Please speak out! The more feedback we get, the better we'll be able to design our APIs. And of course we're interested in volunteers out there... Last but not least, this is an Open Source and community project ;)
@Dave: I've recently added a minimum query API for the QT3TS, Michael Kay's new W3 XQuery Test Suite. Both the test suite driver and the mini API (qt3api) is still work in progress:
https://github.com/BaseXdb/basex-tests/tree/master/src/main/java/org/basex/t...
It it not low-level enough to directly support any axis or update operations; instead new QueryProcessor instances are created to perform queries on intermediate nodes. It would be great if you could have a look at this API, and it would then be interesting to know more about your performance requirements: do you think that the overhead for parsing and compiling query expressions (which usually does not takes longer than some microseconds, and is often faster than the actual axis traversals) will be too expensive in your scenario?
If you believe that this framework would be sufficient, we could start to enhance it, make it safe for concurrent access, document it, etc. If you need to work with the PRE and ID values of database nodes, e.g., you could take advantage of the db: functions of BaseX [1]:
Output: db:node-id($node) Input: db:open-id($db, $id)
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Database_Functions
On Mon, Nov 14, 2011 at 6:26 PM, Dave Glick dglick@dracorp.com wrote:
Hi all,
We've been using BaseX for several years now and have constantly been skirting around our primary use case: using BaseX in an embedded mode. What I mean by this is using BaseX in-process in an application without running any kind of client/server communication bridge and with very direct access to BaseX primitives. There are several reasons for wanting to do this including performance (which seems to be the subject of recent discussions, I.e., running the server in "local" mode). My own primary reason is to gain more direct access to the database objects. For example, we routinely have a need to:
- Directly access and traverse database nodes by climbing, descending,
following, etc.
Insert or remove content at a specific database node
Store references to individual nodes (I.e., using its "pre" and "index"
value)
- Fine-tune queries in order to set context, external functions, etc.
While many of these operations can indeed be performed through the existing client/server interface, it's less friendly - especially when doing things like asking for the next sibling of a given node. With a direct embedded API you just get the next node, bypassing the XQuery processor altogether. From my current work in this area, I think BaseX is already "primed" for this kind of API - 90% or more of the code is already in place since most of the primitives already expose common methods for use by database commands, XQuery processor, etc. All that should be needed is to expose this functionality in a stable and complete API.
Good examples of applications that may need this kind of API include media players (I.e., for storage of the media library data), simple stand-alone database applications, etc. Until recently, we've been able to adapt BaseX to fit our needs by writing a thin wrapper layer that interfaces with the appropriate BaseX classes. However, with the rapid pace of BaseX development these days it's becoming increasingly difficult to track each release since we rely on aspects of the BaseX codebase that are not really intended for public consumption and thus keep changing. This brings up a couple questions:
Are we the only ones interested in a direct embedded interface?
Does the BaseX team have any plans to implement such an interface?
Would such an interface be better implemented by the BaseX team (as
opposed to a third party)?
I don't mind doing some work in this area, however, I have some concerns about doing so. Primarily, given that the whole idea would be to make direct integration easier and more stable it seems like the structure and layout of the classes in the embedded API and the ways that they interact with the underlying BaseX objects should probably be determined by the BaseX team. The danger is that someone outside the team spends effort creating such an interface only to do things in a way that's either not preferred or difficult to maintain as the core team continues to improve the overall product.
Hopefully this was clear... Thoughts?
Dave
Hello Dave,
once more, thanks a lot for your detailed wrap-up on how you are working with our code base, and what are the main challenges. I also got some positive response on your mail offline. It seems that most of our subscribers avoid too much publicity, which is why I got most replies on my API call directly to my mail address (and which is one of the reasons why my answer is pretty much delayed this time..). So everyone: please don't be too shy ;) The more questions can be discussed online, the better!
I agree with pretty everything you wrote in your e-mail. The existing standard APIs (XML:DB, XQJ and others) are way too limited to make BaseX flourish; and it's for some good reason why we also chose a tight coupling of the database and visualization operations in our GUI. Next, our visualizations would get much slower if we decided to use XQuery for all the database communication. It would be an interesting exercise, however, as our current approach does not accept any concurrent update operations, and many reported bugs are due to that problem (client/server based updates are not reflected in the GUI at all..). In the (not so near) future, we want to precompile our XQuery expressions, and the resulting architecture could be very efficient to cover such use cases. Just in case you don't know (but I guess you know about BaseX than most other developers out there..;): the db:open-pre() and db:node-pre() can be used to directly access nodes in the database. If those queries will be pre-compiled, the overhead for evaluating a query will further decrease.
At the same time, it's a real challenge to specify a neat and yet complete superset of all operations that may be needed - and the more e-mails I get and conversations I have, the less I know how it could look like.. Which means that we'll probably have to offer at least two (if not three) APIs to meet all the requirements I encountered so far.
It was interesting to read how you are dealing with updated database nodes in your current architecture. In the middle-term, we might introduce the already working ID-PRE mapping, which might speed up many of your remapping operations. If our mapping algorithms turn out to be efficient enough, an upcoming low-level API might then use the the ID value and the database name as unique reference.
Regarding the release of your .NET API: yes, it would be great to have it publicly available. Maybe you could start off with a restrictive license and make it more liberal as time progresses?
Some quick answers on your questions you asked in another mail:
How can a NodeCache be constructed from a MemData?
This one might help:
MemData data = ... int pre = ... NodeCache nc = new NodeCache(); nc.add(new DBNode(data, pre));
It appears as though InputInfo is mainly used for error reporting - does it have other uses that I need to watch out for (I'm thinking I'll just initialize it to default values)?
Yes, InputInfo is only needed for error feedback; you may safely set it to "null" if it wouldn't include any useful information anyway.
Does this whole strategy sound reasonable or is there a better way to conduct update operations on the database given an input stream?
I guess I'll address this one in a second mail..
Christian ___________________________
On Tue, Nov 15, 2011 at 5:27 PM, Dave Glick dglick@dracorp.com wrote:
Christian,
Thank you, as always, for your thoughtful and thorough response. I'm excited to hear that this is already being discussed around BaseX world headquarters. Several of your possible approaches raise interesting questions or capabilities that I hadn't considered, especially given my somewhat myopic view of the situation based on our own requirements. As you guessed, my own interest is in directly exposing the underlying data structures in a more stable and cohesive way. I've looked in-depth at the existing APIs (DOM, XML:DB, and XQJ) and while the development idealist in me wants to use and champion standard protocols, the pragmatist that needs to get a job done accepts that such standards tend to support the lowest common denominator and thus fail to capture more complex or specific use cases (though I still applaud you guys for providing and supporting such a broad range of APIs, which are certainly valuable in many situations). While the existing APIs do provide a lot of functionality, they're not consistent (use DOM for traversing nodes, XQJ for querying, etc.) and there are gaps in functionality (creation and manipulation of databases, direct updating, etc.). In any case, the existence of these other APIs has provided a lot of sample code that we've adapted in our own embedded API.
@ Eric - you mentioned that you're working on an Android port and have taken the viewpoint that BaseX is a framework on which other tools can be built. That's very much in line with how we've been treating it. For a little background, we've been using BaseX to support a variety of projects, the largest of which is a generic RCP for the manipulation of modeling and simulation data (think Eclipse for M&S). We take inputs to models and simulations, parse them up into XML, and then store, query, and manipulate them in BaseX with lots of frontend visualizations and other graphical interfaces. The challenging part of our work is that we use .NET and Mono, so we cross-compile BaseX using IKVM and then wrap it to make it easier to work with in .NET. Since all of our work uses BaseX as an embedded database, we've found that we're basically developing something akin to an embedded API in our wrapper. Rather than create such an API inside the .NET wrapper, I'd much prefer the API existed in BaseX itself and the wrapper could be "thinner". The biggest problem we have is keeping up with changes to the classes in BaseX. It's being released and improved on such a rapid cycle (great job!) that we're constantly reworking portions of our API to match, and frequently get multiple releases behind.
Perhaps it would help if I shared some of our own use cases. Obviously we're not the only potential users of an embedded API, but some of the issues we've had using it in that mode may help refine requirements for an eventual implementation.
- Node referencing and persistence
We interact with the database a lot. The general pattern is that a given visualization or view in our tool will execute a query to return a result set and will then store references to the resultant nodes. The application provides a multiple document interface and each view on a given document can potentially manipulate the database. This is why we store references to each node. Some of the queries are rather time consuming and rather than re-evaluate them each time the data is changed in other views, we simply persist the node references through changes.
This was one of the harder areas to get right in our API. Because BaseX is primarily client/server driven and focused on returning serialized results, the notion of node references and persistence just isn't there. Internally, BaseX stores a pre value inside it's node objects. This is fine until the database changes and something gets inserted or deleted and the pre values change. At that point, the previous pre value no longer refers to the same node and any objects that rely on it are invalidated. This works great for all the current use cases because there's no need for persistence in the node references following database updates. However, in an embedded environment, there are lots of reasons why node references should be persisted. To solve this we used the following algorithm:
- We initialize node objects with a Data and pre value, just like the internal DBNode class.
- We immediately fetch the associated (and immutable) index for the pre on construction.
- We also store the timestamp for the most recent database modification.
- Every time an operation is performed on a node object, it first checks to see if the saved timestamp is different than the database one, and if so it updates the pre value by using the stored index value and the new timestamp is saved.
- If the index no longer exists (the node was removed) the object is invalidated.
This ensures that a reference to a node object continues to reference the "same" node no matter what happens to the database.
- Node traversal
We traverse nodes in our application. A lot. There are many cases where an operation needs to be performed dynamically and it's much easier to manipulate the database directly than attempt to construct an XQuery that captures the desired behavior and execute that. Not to mention that we've realized significant performance gains by bypassing the query engine and relying on direct manipulation for operations that perform large numbers of traversal and modification operations (sometimes in the 10,000 or 100,000 node order of magnitude). In fact, this is already captured in the current DOM API implementation. I would just suggest that any future embedded API also make sure to provide direct and efficient node traversal operations.
- Direct updating
If you can store a reference to nodes, and you can traverse them to get to other nodes, it stands to reason that you should be able to change them as well. This is another area where we've spent significant effort creating some capability that didn't already exist. In our own node class we provide operations for setting inner text (or atomic value if you prefer), inner XML, and outer XML (though that last one is kind of a hack since we just remove the node in question and insert new content as a child of the parent at the appropriate location). Again, this is an area we noticed performance gains over using XQuery update when appropriate. That and we implemented a lot of the logic before XQuery update was even supported in BaseX (though we've had to change it significantly to keep up). This seems like another area where a complete embedded API should provide some capability.
- Query manipulation
I think XQJ is on the right track here. Ideally, the notion of a query should be captured in an object and exposed so that the context can be manipulated, collections can be created, variables can be created/initialized, etc. I'd just like to see the eventual embedded API include this functionality directly integrated with the other parts of the embedded API.
I would be happy to download and start looking at QT3TS - I respect Michael's work on Saxon and am interested to see what his test suite looks like. I'd also be happy to run some benchmarking and performance comparisons on direct manipulation vs. XQuery evaluation for various operations in order to get some hard numbers. Anecdotally, my general experience has been that directly manipulating the database through primitives is much more efficient for large numbers of operations or situations where the logic is very dynamic or complex and an XQuery statement would be long or difficult to derive.
This email ended up being a bit longer than I had intended - but that always seems to happen with me :). I hope of my stream-of-consciousness was valuable. Let's keep up this discussion...I'll write more as I think of it.
Dave
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Monday, November 14, 2011 7:56 PM To: Dave Glick Cc: BaseX-Talk@mailman.uni-konstanz.de; Eric Murphy; Charles Foster Subject: Re: [basex-talk] Embedded API
Hi Dave, hi all,
better Java APIs for BaseX - yes, that's a very relevant topic nowadays, something that we've frequently been discussing for the last weeks in our team. And the main challenge we are struggling with is that there are just too many ways how such an API could look like - and too many incoming requests that can hardly be bundled in one single API.. Here are some of the requirements we're dealing with, and the approaches that could be pursued (..and I already know which of them you would prefer ;) :
a new Command and Query/Result API could enhance/replace the existing light-weight client Java API, and the representation of results would be separated from the low-level data structures in BaseX. This API could be used in the client/server architecture as well, but it would introduce some overhead, as all the data structures would have to be replicated by the client.
The new Command, Query and Result objects could also be made serializable. This way, they could be easily transfered over the network, and there would be no need to develop custom binary protocols.
a real embedded API could ensure that developers do not suffer from frequent changes in our query and storage backend. Instead, we would ensure that the API does not change as long as the major version is not updated. This API would be much more efficient than a client/server API, but we might have to put more work into transactional issues.
the existing XML:DB and XQJ APIs could be revised and updated to support the client/server architecture. This could reduce the need for any other client/server-based API with a richer functionality.
Everyone who is interested in more powerful APIs.. Please speak out! The more feedback we get, the better we'll be able to design our APIs. And of course we're interested in volunteers out there... Last but not least, this is an Open Source and community project ;)
@Dave: I've recently added a minimum query API for the QT3TS, Michael Kay's new W3 XQuery Test Suite. Both the test suite driver and the mini API (qt3api) is still work in progress:
https://github.com/BaseXdb/basex-tests/tree/master/src/main/java/org/basex/t...
It it not low-level enough to directly support any axis or update operations; instead new QueryProcessor instances are created to perform queries on intermediate nodes. It would be great if you could have a look at this API, and it would then be interesting to know more about your performance requirements: do you think that the overhead for parsing and compiling query expressions (which usually does not takes longer than some microseconds, and is often faster than the actual axis traversals) will be too expensive in your scenario?
If you believe that this framework would be sufficient, we could start to enhance it, make it safe for concurrent access, document it, etc. If you need to work with the PRE and ID values of database nodes, e.g., you could take advantage of the db: functions of BaseX [1]:
Output: db:node-id($node) Input: db:open-id($db, $id)
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Database_Functions
On Mon, Nov 14, 2011 at 6:26 PM, Dave Glick dglick@dracorp.com wrote:
Hi all,
We've been using BaseX for several years now and have constantly been skirting around our primary use case: using BaseX in an embedded mode. What I mean by this is using BaseX in-process in an application without running any kind of client/server communication bridge and with very direct access to BaseX primitives. There are several reasons for wanting to do this including performance (which seems to be the subject of recent discussions, I.e., running the server in "local" mode). My own primary reason is to gain more direct access to the database objects. For example, we routinely have a need to:
- Directly access and traverse database nodes by climbing, descending,
following, etc.
Insert or remove content at a specific database node
Store references to individual nodes (I.e., using its "pre" and "index"
value)
- Fine-tune queries in order to set context, external functions, etc.
While many of these operations can indeed be performed through the existing client/server interface, it's less friendly - especially when doing things like asking for the next sibling of a given node. With a direct embedded API you just get the next node, bypassing the XQuery processor altogether. From my current work in this area, I think BaseX is already "primed" for this kind of API - 90% or more of the code is already in place since most of the primitives already expose common methods for use by database commands, XQuery processor, etc. All that should be needed is to expose this functionality in a stable and complete API.
Good examples of applications that may need this kind of API include media players (I.e., for storage of the media library data), simple stand-alone database applications, etc. Until recently, we've been able to adapt BaseX to fit our needs by writing a thin wrapper layer that interfaces with the appropriate BaseX classes. However, with the rapid pace of BaseX development these days it's becoming increasingly difficult to track each release since we rely on aspects of the BaseX codebase that are not really intended for public consumption and thus keep changing. This brings up a couple questions:
Are we the only ones interested in a direct embedded interface?
Does the BaseX team have any plans to implement such an interface?
Would such an interface be better implemented by the BaseX team (as
opposed to a third party)?
I don't mind doing some work in this area, however, I have some concerns about doing so. Primarily, given that the whole idea would be to make direct integration easier and more stable it seems like the structure and layout of the classes in the embedded API and the ways that they interact with the underlying BaseX objects should probably be determined by the BaseX team. The danger is that someone outside the team spends effort creating such an interface only to do things in a way that's either not preferred or difficult to maintain as the core team continues to improve the overall product.
Hopefully this was clear... Thoughts?
Dave
basex-talk@mailman.uni-konstanz.de