Hi Dave,
just some quick feedback, regarding your idea of streaming data directly in the database: while I generally find this a really elegant approach, I have some concerns when it comes to the insertion of invalid input: I'm not sure which effect XML streams will have that get interrupted for some more or less unknown reason, or that contain corrupt contents. In this case, we'd probably have to roll back our update operation in order not to corrupt the database, which isn't possible in the existing architecture. If I remember right, this was also the main reason why we added a database cache in the first place.
What do you think about that issue? I hope my worries are not too pessimistic here; once we support MVCC, this may appear in a different light (but I can't give any perspective on the time frame yet).
Regarding performance: have you already done some general profiling (e.g. via Xrunhprof) to sort out what are the major bottlenecks in the current architecture? We've recently done some optimizations for inserting JSON tweets into BaseX, and we've managed to increase the number of inserted items from ~300 to impressive 50.000 per second. The tests were using the existing BaseX classes and methods (InsertInto, Data.insert), and most time could eventually be saved by..
- setting the database option AUTOFLUSH to false and explicitly calling the FLUSH command whenever appropriate - buffering some JSON inputs and inserting them in a bulk operation with a single XQuery - some optimizations in the Data.insert() method itself (...already committed to the repository)
Just my few cents; I must admit that I know too less about your architecture to allow myself to make any educated guesses..
Christian ___________________________
On Thu, Nov 17, 2011 at 2:46 AM, Dave Glick dglick@dracorp.com wrote:
At the risk of beating a dead horse, I've continued to look at this. I keep coming back to the Data.insert() method (which is what we based our original streaming insertion code on). It seems to me that there is an inefficiency present in some cases. For example, the sequence for the Add command handler is (correct me if I'm wrong):
- Construct a Parser for the input content
- Create a Builder for a temporary database (memory or disk depending on content size, if it can be determined)
- Parse the content and build the temporary database
- Traverse the temporary database and construct a temporary insertion buffer
- Periodically flush the buffer to the target database
I have to wonder if the temporary database is needed at all? The Parser or Builder should be able to stream updates to the target database without having to construct a temporary one. Indeed, that's exactly what our own current code does - it uses the logic in the data.insert() method (or at least an older version of it) but instead of manipulating the update buffer based on traversal of a temporary database it does it based on events from the streaming reader. The sequence would then be:
- Construct a Parser for the input content
- Parse the content and construct a temporary insertion buffer
- Periodically flush the buffer to the target database
For normal BaseX operation this might only improve the performance of document insertions using the Add command. There are probably lots of reasons why the XQuery update code should continue to construct a temporary database. However, even improving that initial document load performance may be worth the effort. In addition, being able to insert to the database directly from a stream may be useful in the future, for example as part of an embedded API.
The best way to generalize this behavior is probably by creating a new Builder subclass that manipulates a pre-existing Data instance rather than a new one - let's call it InsertBuilder. This way you could use any Parser stream to send content to the database for insertion. One complication would be that you essentially have duplicate code at that point - the Data.insert() method and the InsertBuilder would have very similar logic and the OO purist in me doesn't like this idea. A way to reduce this duplication would be to use the InsertBuilder even when the source is another Data instance. Since a Builder gets "triggered" by a Parser, this would require a Parser subclass that fires events by traversing a Data instance, let's call it DataParser. I do notice that there is a DBParser internal class inside OptimizeAll - does this do something similar?
Using our existing codebase as well as the current data.insert() method as guidance I don't think it would be too challenging to create the InsertBuilder and DataParser classes, and I don't mind taking on the task. The question is, does this seem valuable to anyone but me? Would such change be a welcome addition to BaseX or would it simply make things too complicated? Are there reasons I haven't thought of why this isn't a great idea or wouldn't result in performance boosts (quite likely)?
Well, I've polluted this message board enough for a couple days. If I'm starting to be a nuisance, please by all means tell me to cool off for a while - it won't hurt my feelings :)
Dave
-----Original Message----- From: Dave Glick Sent: Wednesday, November 16, 2011 5:54 PM To: 'BaseX-Talk@mailman.uni-konstanz.de' Subject: Streaming Updates (Embedded API)
So I've come across another important use case for an embedded API: direct streaming updates. What I mean by this is the ability to perform update operations (such as insertion) on the database using a stream that contains input content.
As you know, I'm in the process of updating our own embedded API from an old version 5.X of BaseX to version 7 (which is what started this whole embedded API train of thought). We had code in place to accomplish this in our old version by reading from a stream (a .NET XmlReader, which is SAX-like), applying direct Data update calls (such as Data.elem() or Data.text()) for each event, and then flushing the Data object when the stream ran out. This has two problems - first, the code is very complex (I.e., to keep track of the current insertion point, resolve namespaces, etc.) and is difficult to maintain and second, it probably isn't conformant since it doesn't do things like check for adjacent text nodes at the insertion point. I would love to replace this code with something much simpler and let BaseX handle the details so I can be sure it works right.
Which brings me to some more immediate questions than the more theoretical API discussion. I'm looking for ways to use existing BaseX functionality to manage database updates, particularly for very large content streams. I considered just creating appropriate queries and parsing/evaluating them, but some of the input content could be very large (as in, might overrun a single string buffer) and it doesn't seem totally efficient to read from a tokenizing SAX-like stream, convert to a string, then read into a second tokenizing string inside the query processor. So then I started looking at the internals of the query processor, particularly as it relates to XQuery updates. I found the DatabaseUpdates class and in particular the DatabaseUpdates.apply() method which looks to provide some direct control over the process. I could manually create update primitives, add them to a DatabaseUpdates instance, and then apply that to the database. I think that will accomplish what I'm looking for.
The main area where I'm hung up right now is how to specify the input stream/content for the update operation. My current thinking is to use the stream to create an in-memory temporary database that holds the content. This shouldn't be tough since I already have a .NET XmlReader to BaseX Parser bridge in place and I can use that to create a MemData instance pretty easily. However, I could use some help getting from the MemData to a NodeCache that can be used in the constructor for update primitives like InsertBefore.
So, another long post made short, here are my questions:
- How can a NodeCache be constructed from a MemData?
- It appears as though InputInfo is mainly used for error reporting - does it have other uses that I need to watch out for (I'm thinking I'll just initialize it to default values)?
- Does this whole strategy sound reasonable or is there a better way to conduct update operations on the database given an input stream?
BTW - I know it's been a long time coming, but we have every intention of releasing our .NET API once we get it updated and documented/packaged (which has been our main hang-up in the past - never enough time for polish). I'm not sure what kind of licensing model we're going to use. I'm personally pushing for open source (preferably BSD like BaseX), but we might end up with something like a GPL with an available commercial option or non-open source freemium model (unfortunately, got to please the accountants). If any of the community members out there are interested in the project and have some input (what you might be willing to pay, what kind of .NET-centric additional functionality would be valuable, etc.) I'd love to hear it.
Thanks,
Dave
-----Original Message----- From: Dave Glick Sent: Tuesday, November 15, 2011 11:28 AM To: 'Christian Grün' Cc: BaseX-Talk@mailman.uni-konstanz.de; Eric Murphy; Charles Foster Subject: RE: [basex-talk] Embedded API
Christian,
Thank you, as always, for your thoughtful and thorough response. I'm excited to hear that this is already being discussed around BaseX world headquarters. Several of your possible approaches raise interesting questions or capabilities that I hadn't considered, especially given my somewhat myopic view of the situation based on our own requirements. As you guessed, my own interest is in directly exposing the underlying data structures in a more stable and cohesive way. I've looked in-depth at the existing APIs (DOM, XML:DB, and XQJ) and while the development idealist in me wants to use and champion standard protocols, the pragmatist that needs to get a job done accepts that such standards tend to support the lowest common denominator and thus fail to capture more complex or specific use cases (though I still applaud you guys for providing and supporting such a broad range of APIs, which are certainly valuable in many situations). While the existing APIs do provide a lot of functionality, they're not consistent (use DOM for traversing nodes, XQJ for querying, etc.) and there are gaps in functionality (creation and manipulation of databases, direct updating, etc.). In any case, the existence of these other APIs has provided a lot of sample code that we've adapted in our own embedded API.
@ Eric - you mentioned that you're working on an Android port and have taken the viewpoint that BaseX is a framework on which other tools can be built. That's very much in line with how we've been treating it. For a little background, we've been using BaseX to support a variety of projects, the largest of which is a generic RCP for the manipulation of modeling and simulation data (think Eclipse for M&S). We take inputs to models and simulations, parse them up into XML, and then store, query, and manipulate them in BaseX with lots of frontend visualizations and other graphical interfaces. The challenging part of our work is that we use .NET and Mono, so we cross-compile BaseX using IKVM and then wrap it to make it easier to work with in .NET. Since all of our work uses BaseX as an embedded database, we've found that we're basically developing something akin to an embedded API in our wrapper. Rather than create such an API inside the .NET wrapper, I'd much prefer the API existed in BaseX itself and the wrapper could be "thinner". The biggest problem we have is keeping up with changes to the classes in BaseX. It's being released and improved on such a rapid cycle (great job!) that we're constantly reworking portions of our API to match, and frequently get multiple releases behind.
Perhaps it would help if I shared some of our own use cases. Obviously we're not the only potential users of an embedded API, but some of the issues we've had using it in that mode may help refine requirements for an eventual implementation.
- Node referencing and persistence
We interact with the database a lot. The general pattern is that a given visualization or view in our tool will execute a query to return a result set and will then store references to the resultant nodes. The application provides a multiple document interface and each view on a given document can potentially manipulate the database. This is why we store references to each node. Some of the queries are rather time consuming and rather than re-evaluate them each time the data is changed in other views, we simply persist the node references through changes.
This was one of the harder areas to get right in our API. Because BaseX is primarily client/server driven and focused on returning serialized results, the notion of node references and persistence just isn't there. Internally, BaseX stores a pre value inside it's node objects. This is fine until the database changes and something gets inserted or deleted and the pre values change. At that point, the previous pre value no longer refers to the same node and any objects that rely on it are invalidated. This works great for all the current use cases because there's no need for persistence in the node references following database updates. However, in an embedded environment, there are lots of reasons why node references should be persisted. To solve this we used the following algorithm:
- We initialize node objects with a Data and pre value, just like the internal DBNode class.
- We immediately fetch the associated (and immutable) index for the pre on construction.
- We also store the timestamp for the most recent database modification.
- Every time an operation is performed on a node object, it first checks to see if the saved timestamp is different than the database one, and if so it updates the pre value by using the stored index value and the new timestamp is saved.
- If the index no longer exists (the node was removed) the object is invalidated.
This ensures that a reference to a node object continues to reference the "same" node no matter what happens to the database.
- Node traversal
We traverse nodes in our application. A lot. There are many cases where an operation needs to be performed dynamically and it's much easier to manipulate the database directly than attempt to construct an XQuery that captures the desired behavior and execute that. Not to mention that we've realized significant performance gains by bypassing the query engine and relying on direct manipulation for operations that perform large numbers of traversal and modification operations (sometimes in the 10,000 or 100,000 node order of magnitude). In fact, this is already captured in the current DOM API implementation. I would just suggest that any future embedded API also make sure to provide direct and efficient node traversal operations.
- Direct updating
If you can store a reference to nodes, and you can traverse them to get to other nodes, it stands to reason that you should be able to change them as well. This is another area where we've spent significant effort creating some capability that didn't already exist. In our own node class we provide operations for setting inner text (or atomic value if you prefer), inner XML, and outer XML (though that last one is kind of a hack since we just remove the node in question and insert new content as a child of the parent at the appropriate location). Again, this is an area we noticed performance gains over using XQuery update when appropriate. That and we implemented a lot of the logic before XQuery update was even supported in BaseX (though we've had to change it significantly to keep up). This seems like another area where a complete embedded API should provide some capability.
- Query manipulation
I think XQJ is on the right track here. Ideally, the notion of a query should be captured in an object and exposed so that the context can be manipulated, collections can be created, variables can be created/initialized, etc. I'd just like to see the eventual embedded API include this functionality directly integrated with the other parts of the embedded API.
I would be happy to download and start looking at QT3TS - I respect Michael's work on Saxon and am interested to see what his test suite looks like. I'd also be happy to run some benchmarking and performance comparisons on direct manipulation vs. XQuery evaluation for various operations in order to get some hard numbers. Anecdotally, my general experience has been that directly manipulating the database through primitives is much more efficient for large numbers of operations or situations where the logic is very dynamic or complex and an XQuery statement would be long or difficult to derive.
This email ended up being a bit longer than I had intended - but that always seems to happen with me :). I hope of my stream-of-consciousness was valuable. Let's keep up this discussion...I'll write more as I think of it.
Dave
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Monday, November 14, 2011 7:56 PM To: Dave Glick Cc: BaseX-Talk@mailman.uni-konstanz.de; Eric Murphy; Charles Foster Subject: Re: [basex-talk] Embedded API
Hi Dave, hi all,
better Java APIs for BaseX - yes, that's a very relevant topic nowadays, something that we've frequently been discussing for the last weeks in our team. And the main challenge we are struggling with is that there are just too many ways how such an API could look like - and too many incoming requests that can hardly be bundled in one single API.. Here are some of the requirements we're dealing with, and the approaches that could be pursued (..and I already know which of them you would prefer ;) :
a new Command and Query/Result API could enhance/replace the existing light-weight client Java API, and the representation of results would be separated from the low-level data structures in BaseX. This API could be used in the client/server architecture as well, but it would introduce some overhead, as all the data structures would have to be replicated by the client.
The new Command, Query and Result objects could also be made serializable. This way, they could be easily transfered over the network, and there would be no need to develop custom binary protocols.
a real embedded API could ensure that developers do not suffer from frequent changes in our query and storage backend. Instead, we would ensure that the API does not change as long as the major version is not updated. This API would be much more efficient than a client/server API, but we might have to put more work into transactional issues.
the existing XML:DB and XQJ APIs could be revised and updated to support the client/server architecture. This could reduce the need for any other client/server-based API with a richer functionality.
Everyone who is interested in more powerful APIs.. Please speak out! The more feedback we get, the better we'll be able to design our APIs. And of course we're interested in volunteers out there... Last but not least, this is an Open Source and community project ;)
@Dave: I've recently added a minimum query API for the QT3TS, Michael Kay's new W3 XQuery Test Suite. Both the test suite driver and the mini API (qt3api) is still work in progress:
https://github.com/BaseXdb/basex-tests/tree/master/src/main/java/org/basex/t...
It it not low-level enough to directly support any axis or update operations; instead new QueryProcessor instances are created to perform queries on intermediate nodes. It would be great if you could have a look at this API, and it would then be interesting to know more about your performance requirements: do you think that the overhead for parsing and compiling query expressions (which usually does not takes longer than some microseconds, and is often faster than the actual axis traversals) will be too expensive in your scenario?
If you believe that this framework would be sufficient, we could start to enhance it, make it safe for concurrent access, document it, etc. If you need to work with the PRE and ID values of database nodes, e.g., you could take advantage of the db: functions of BaseX [1]:
Output: db:node-id($node) Input: db:open-id($db, $id)
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Database_Functions
On Mon, Nov 14, 2011 at 6:26 PM, Dave Glick dglick@dracorp.com wrote:
Hi all,
We've been using BaseX for several years now and have constantly been skirting around our primary use case: using BaseX in an embedded mode. What I mean by this is using BaseX in-process in an application without running any kind of client/server communication bridge and with very direct access to BaseX primitives. There are several reasons for wanting to do this including performance (which seems to be the subject of recent discussions, I.e., running the server in "local" mode). My own primary reason is to gain more direct access to the database objects. For example, we routinely have a need to:
- Directly access and traverse database nodes by climbing, descending,
following, etc.
Insert or remove content at a specific database node
Store references to individual nodes (I.e., using its "pre" and "index"
value)
- Fine-tune queries in order to set context, external functions, etc.
While many of these operations can indeed be performed through the existing client/server interface, it's less friendly - especially when doing things like asking for the next sibling of a given node. With a direct embedded API you just get the next node, bypassing the XQuery processor altogether. From my current work in this area, I think BaseX is already "primed" for this kind of API - 90% or more of the code is already in place since most of the primitives already expose common methods for use by database commands, XQuery processor, etc. All that should be needed is to expose this functionality in a stable and complete API.
Good examples of applications that may need this kind of API include media players (I.e., for storage of the media library data), simple stand-alone database applications, etc. Until recently, we've been able to adapt BaseX to fit our needs by writing a thin wrapper layer that interfaces with the appropriate BaseX classes. However, with the rapid pace of BaseX development these days it's becoming increasingly difficult to track each release since we rely on aspects of the BaseX codebase that are not really intended for public consumption and thus keep changing. This brings up a couple questions:
Are we the only ones interested in a direct embedded interface?
Does the BaseX team have any plans to implement such an interface?
Would such an interface be better implemented by the BaseX team (as
opposed to a third party)?
I don't mind doing some work in this area, however, I have some concerns about doing so. Primarily, given that the whole idea would be to make direct integration easier and more stable it seems like the structure and layout of the classes in the embedded API and the ways that they interact with the underlying BaseX objects should probably be determined by the BaseX team. The danger is that someone outside the team spends effort creating such an interface only to do things in a way that's either not preferred or difficult to maintain as the core team continues to improve the overall product.
Hopefully this was clear... Thoughts?
Dave
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk