Navigating a DOM

List overview All Threads
Download

newer

older

id function in BaseX

Bug in document order

Rainer Klute

12 Oct 2012 12 Oct '12

3:58 p.m.

Hi,

I did some first steps with BaseX, but unfortunately I was not very successful.

I have a really large XML file which does not fit into memory, and I would like to navigate it as a DOM. My hope was that I could store it as a BaseX database, retrieve the root element as a org.w3c.dom.Node, and then start navigating down and up the DOM as needed without having to have the whole stuff in memory.

I tried something like this:

QueryProcessor processor = new QueryProcessor("doc('catalog')/*", context); Iter iter = processor.iter(); Item item = iter.next(); Object node = item.toJava();

According to the debugger the item variable indeed denoted my root element. However, when calling item.toJava() not only that very node was returned. Instead BaseX obviously tried to retrieve the whole DOM – which was just the very thing I wanted to avoid.

Am I doing something wrong here? Or is this an unforeseen use case?

By the way, I also had trouble getting any result at all. Only with "doc('catalog')/*" I got an item that was not null. When I tried to retrieve all elements named "article" using "doc('catalog')//article" the item was null. I also tried the item.iter() in order to find a node's children. However, it turned out that item().iter().next() == item.

And I had quite a tough time fiddling around with the documentation and with the JavaDoc. While the documentation puts a lot of effort into XQuery, it remains unclear to some extend how to do some basic stuff with BaseX programmatically. This is a hurdle for the BaseX beginner. Some more Java examples and explanations would be nice showing how to connect to the database, submit a query and process the result. Currently the examples show only how to dump a query result to System.out. It would be interesting to learn about processing it further – see my troubles with Iter.

-- Best regards Rainer Klute

Show replies by date

Dirk Kirsten

13 Oct 13 Oct

5:38 a.m.

Hello Rainer,

Welcome to BaseX and I hope we can help you with your troubles :)

If you are running an XPath like "doc('catalog')//article" and you get a null result, I suspect that your xml document uses some namespaces. If * article* is not in the default namespace, you will correctly retrieve a null value as result. Using the wildcard namespace should resolve your issue, i.e. using "doc('catalog')//*:article". If this does not resolve your issue, it would be nice to have a SSCEE [1] and your xml document. Without the data it is quite difficult to guess the root of the problem.

I personally never used the DOM for processing as XQuery itself is very nice. Indeed, using iteration and .toJava() should return just this single node. Maybe someone else on the list could help here. I agree that the help could be extended. However, there is some extensive example source code for Java available at https://github.com/BaseXdb/basex-examples/tree/master/src/main/java/org/base.... Maybe you could have a look at this source code. Hope this helps.

Cheers, Dirk

[1] http://sscce.org/

On Fri, Oct 12, 2012 at 9:58 PM, Rainer Klute rainer.klute@gmx.de wrote:

...

Hi,

I did some first steps with BaseX, but unfortunately I was not very successful.

I have a really large XML file which does not fit into memory, and I would like to navigate it as a DOM. My hope was that I could store it as a BaseX database, retrieve the root element as a org.w3c.dom.Node, and then start navigating down and up the DOM as needed without having to have the whole stuff in memory.

I tried something like this:
QueryProcessor processor = new QueryProcessor("doc('catalog')/*",
context);
Iter iter = processor.iter();
Item item = iter.next();
Object node = item.toJava();
According to the debugger the item variable indeed denoted my root element. However, when calling item.toJava() not only that very node was returned. Instead BaseX obviously tried to retrieve the whole DOM – which was just the very thing I wanted to avoid.

Am I doing something wrong here? Or is this an unforeseen use case?

By the way, I also had trouble getting any result at all. Only with "doc('catalog')/*" I got an item that was not null. When I tried to retrieve all elements named "article" using "doc('catalog')//article" the item was null. I also tried the item.iter() in order to find a node's children. However, it turned out that item().iter().next() == item.

And I had quite a tough time fiddling around with the documentation and with the JavaDoc. While the documentation puts a lot of effort into XQuery, it remains unclear to some extend how to do some basic stuff with BaseX programmatically. This is a hurdle for the BaseX beginner. Some more Java examples and explanations would be nice showing how to connect to the database, submit a query and process the result. Currently the examples show only how to dump a query result to System.out. It would be interesting to learn about processing it further – see my troubles with Iter.

--

Best regards Rainer Klute

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

Rainer Klute

15 Oct 15 Oct

6:35 a.m.

On 13.10.2012 11:38, Dirk Kirsten wrote:

...

Welcome to BaseX and I hope we can help you with your troubles :)

Yeah, thank you!

...

If you are running an XPath like "doc('catalog')//article" and you get a null result, I suspect that your xml document uses some namespaces. If /article/ is not in the default namespace, you will correctly retrieve a null value as result. Using the wildcard namespace should resolve your issue, i.e. using "doc('catalog')//*:article".

Yes, thanks, paying attention to namespaces was indeed helpful. My application now does what it should do – well, at least sort of. It retrieves the "article" elements from the database, the query returns quickly and I can access the articles one by one.

However, articles are quite shallow. The whole model comprises about 16.000 articles nested in a hierarchy of container elements. The root of everything is the "catalog" element. And when I retrieve that one, the VM still fails with an OutOfMemoryError. The stacktrace shows that the org.basex.query.util.DataBuilder class is busily and recursively adding nodes, elements and atrributes and obviously trys to build up the whole tree in memory.

...

I personally never used the DOM for processing as XQuery itself is very nice. Indeed, using iteration and .toJava() should return just this single node. Maybe someone else on the list could help here.

Speaking as an XQuery newbie, I can imagine that it is possible to implement navigating a DOM up and down and interactively controlled. However, I am not yet convinced that this is the most straight-forward way do to that, and I'd prefer using the DOM directly.

...

I agree that the help could be extended. However, there is some extensive example source code for Java available at https://github.com/BaseXdb/basex-examples/tree/master/src/main/java/org/base.... Maybe you could have a look at this source code. Hope this helps.

Yes, I found these examples. They showed me that there is is QueryProcessor, how to use it to submit a query, get an Iter as result and use toJava() to get a DOM node. On the other hand, examples and API documentation don't help much if you cannot go the DOM way (for the reasons mentioned above) and want to try something else, namely fiddle around with org.basex.query.value.item.Item and its subclasses.

-- Best regards Rainer Klute

Maximilian Gärber

6:56 a.m.

Hi,

when you do not want the children of an item, try the except() function.

Something like:

let $items := doc('catalog')//*:article for $i in $items return $i except ($i/*:article)

would give you the shallow results

Regards,

Max

2012/10/15 Rainer Klute rainer.klute@itemis.de:

...

On 13.10.2012 11:38, Dirk Kirsten wrote:

...
Welcome to BaseX and I hope we can help you with your troubles :)

Yeah, thank you!

...
If you are running an XPath like "doc('catalog')//article" and you get a null result, I suspect that your xml document uses some namespaces. If /article/ is not in the default namespace, you will correctly retrieve a null value as result. Using the wildcard namespace should resolve your issue, i.e. using "doc('catalog')//*:article".

Yes, thanks, paying attention to namespaces was indeed helpful. My application now does what it should do – well, at least sort of. It retrieves the "article" elements from the database, the query returns quickly and I can access the articles one by one.

However, articles are quite shallow. The whole model comprises about 16.000 articles nested in a hierarchy of container elements. The root of everything is the "catalog" element. And when I retrieve that one, the VM still fails with an OutOfMemoryError. The stacktrace shows that the org.basex.query.util.DataBuilder class is busily and recursively adding nodes, elements and atrributes and obviously trys to build up the whole tree in memory.

...
I personally never used the DOM for processing as XQuery itself is very nice. Indeed, using iteration and .toJava() should return just this single node. Maybe someone else on the list could help here.

Speaking as an XQuery newbie, I can imagine that it is possible to implement navigating a DOM up and down and interactively controlled. However, I am not yet convinced that this is the most straight-forward way do to that, and I'd prefer using the DOM directly.

...
I agree that the help could be extended. However, there is some extensive example source code for Java available at https://github.com/BaseXdb/basex-examples/tree/master/src/main/java/org/base.... Maybe you could have a look at this source code. Hope this helps.

Yes, I found these examples. They showed me that there is is QueryProcessor, how to use it to submit a query, get an Iter as result and use toJava() to get a DOM node. On the other hand, examples and API documentation don't help much if you cannot go the DOM way (for the reasons mentioned above) and want to try something else, namely fiddle around with org.basex.query.value.item.Item and its subclasses.

--

Best regards Rainer Klute

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Maximilian Gärber axxepta solutions GmbH Postfach 51 02 38 13362 Berlin Tel +49 (0)30 499 147 66 Fax +49 (0)30 499 147 67 Mail gaerber@axxepta.de

Rainer Klute

7:29 a.m.

On 15.10.2012 12:56, Maximilian Gärber wrote:

...

when you do not want the children of an item, try the except() function.

Something like:

let $items := doc('catalog')//*:article for $i in $items return $i except ($i/*:article)

would give you the shallow results

True in principle, but I won't be able to dive into the acticles later.

-- Best regards Rainer Klute

Christian Grün

23 Oct 23 Oct

2:59 p.m.

Dear Rainer,

...

I have a really large XML file which does not fit into memory, and I would like to navigate it as a DOM. My hope was that I could store it as a BaseX database, retrieve the root element as a org.w3c.dom.Node, and then start navigating down and up the DOM as needed without having to have the whole stuff in memory.

By accident, a previous version of BaseX was working as doing exactly what you were describing. In more recent versions, the DOM node is completely materialized in memory, because lazy processing was causing too many unwanted side effects regarding concurrency and node caching. While the resulting representation takes less space than the original Java DOM representation, and is faster in many cases, it still takes about 2-3 times of the size of the textual representation.

What you can do, however, and what we regularly do, is using our internal node representation. A small example is shown in the following:

Context context = new Context(); QueryProcessor processor = new QueryProcessor("doc('catalog')/*", context); context.register(processor); Iter iter = processor.iter(); Item item = iter.next(); if(item instanceof ANode) { ANode node = (ANode) item; System.out.println("Name: " + node.qname()); for(final ANode child : node.children()) { System.out.println("- Child: " + child); } } processor.close(); context.unregister(processor); context.close();

Please remember to close the processor after having requested all nodes; otherwise, the database will be kept open. Using context.register(), you can be sure that no other write operation will modify your data as long as you're requesting it. If concurrency is no issue, feel free to remove the (un)register calls.

...

And I had quite a tough time fiddling around with the documentation and with the JavaDoc. While the documentation puts a lot of effort into XQuery, it remains unclear to some extend how to do some basic stuff with BaseX programmatically. This is a hurdle for the BaseX beginner.

Absolutely true; our documentation is rather sparse when it comes to our internal low level API, and we are well aware that many of our users would benefit from some more brain food reg. our architecture. As a matter of fact, writing a good documentation takes a lot of resources, which is why we are always thankful for external contributions.

Still, we are doing our best to document our source code as good as possible. It may help a lot when you want to leave our high-level APIs, such as the client APIs and XQJ.

Christian

Rainer Klute

25 Oct 25 Oct

10:09 a.m.

Hi Christian,

I just did a quick test using your suggestion, and what can I say? It seems to work very well! I guess the internal node representation is indeed what I need. Next month I'll get back to it and probably bother you with questions about that internal stuff. :-)

Thanks a million!

-- Best regards Rainer Klute

Rainer Klute

29 Oct 29 Oct

12:14 p.m.

On 25.10.2012 16:09, Rainer Klute wrote:

...

I just did a quick test using your suggestion, and what can I say? It seems to work very well! I guess the internal node representation is indeed what I need. Next month I'll get back to it and probably bother you with questions about that internal stuff. :-)

Christian,

do you have some info, documentation or sample code on how to access a DBNode if it represents an XML element, an attribute or some other type of XML node? Presently I am trying to figure out how to retrieve an attribute's value. Thanks!

-- Best regards Rainer Klute

Christian Grün

6:47 p.m.

Hi Rainer,

I’m sorry there is no public documentation on our low-level API. Did you have a look at the JavaDoc of our source code?

This is e.g. how you may request the value of an attribute:

ANode node = ... byte[] value = node.attribute(new QNm("name")); System.out.println(Token.string(value));

This is how you can iterate through all attributes:

for(final ANode att : node.attributes()) { ... }

Hope this helps, Christian ___________________________

On Mon, Oct 29, 2012 at 5:14 PM, Rainer Klute rainer.klute@itemis.de wrote:

...

On 25.10.2012 16:09, Rainer Klute wrote:

...
I just did a quick test using your suggestion, and what can I say? It seems to work very well! I guess the internal node representation is indeed what I need. Next month I'll get back to it and probably bother you with questions about that internal stuff. :-)

Christian,

do you have some info, documentation or sample code on how to access a DBNode if it represents an XML element, an attribute or some other type of XML node? Presently I am trying to figure out how to retrieve an attribute's value. Thanks!

--

Best regards Rainer Klute

4645

Age (days ago)

4662

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

5 participants

tags (0)

participants (5)

Christian Grün
Dirk Kirsten
Maximilian Gärber
Rainer Klute
Rainer Klute