BaseX XQuery vs. python / lxml performance

List overview All Threads
Download

newer

older

Java interop - Unexpected FORG0006...

Updates failing due to "opened by...

Ronny Möbius

28 Mar 2012 28 Mar '12

2:49 p.m.

Hi,

we use now for a couple of years XML files for organizing the schedule of lectures in our institute. The amount data for one semester is about one megabyte. For presenting the schedule on the web we use currently python in combination with the C-based lxml XPath implementation. Only basic selecting nodes happens on the layer of lxml, but the most of intelligence is implemented in filtering python lists and lists of dictionaries. We read the xml data from hard disk and parse them into lxml objects.

...

From time to time we thought about storing the data in a XML database as

this seems intuitive. One main point is, that we need some sort of transaction management, because different applications may manipulate the files simultaneously.

Now I gave BaseX a try and implemented one basic output of our online schedule in XQuery. Soon I noticed, that all of the intelligence should be managed inside XQuery as it should return the full-prepared html for presenting online.

Unfortunately we now have the impression, that we did not gain speed - to the contrary the query itself needs more execution time than the whole corresponding ready interface (including the user interface etc.)

I'm now interested in your general opinion about this: Is it surprising, that the XQuery implementation than the lxml/python one (For me it is, as I thought the indices etc. created when importing the data should decrease computational affords in searching the tree)? Is there some catch in my approach? May the reason be bad designing the query?

I would appreciate to hear from you, Ronny

Show replies by date

Charles Duffy

28 Mar 28 Mar

3:07 p.m.

On 03/28/2012 01:49 PM, Ronny Möbius wrote:

...

I'm now interested in your general opinion about this: Is it surprising, that the XQuery implementation than the lxml/python one (For me it is, as I thought the indices etc. created when importing the data should decrease computational affords in searching the tree)? Is there some catch in my approach? May the reason be bad designing the query?

Howdy, Ronny --

I can't speak for the BaseX team, but I can certainly say that the specific queries in use make a very, very big difference.

(Personally, by the way, I'm in the process of moving as much logic as I can into code running in BaseX simply because using XQuery 3.0 makes it hard to go back to the minimal XPath 1.0 implementation that libxml2, and thus lxml, supports).

Johannes.Lichtenberger

5:19 p.m.

On 03/28/2012 08:49 PM, Ronny Möbius wrote:

...

I'm now interested in your general opinion about this: Is it surprising, that the XQuery implementation than the lxml/python one (For me it is, as I thought the indices etc. created when importing the data should decrease computational affords in searching the tree)? Is there some catch in my approach? May the reason be bad designing the query?

I suppose it relies heavily on the query and the query processor, rewritings, cost based analysis and index structures. Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor. Maybe you can also create a database in memory which doesn't have to be persisted (plus deactivating certain indexes which are not needed).

kind regards, Johannes

Michael Seiferle

29 Mar 29 Mar

5 a.m.

Hi Ronny,

Hi Johannes & Charles, thanks for joining the conversation.

In my opinion, and speaking officially for BaseX, I'd suppose that XML processing with BaseX databases should almost always[1] be faster than processing the XML sequentially via lxml.

However, performance may vary depending on the actual queries and/or the python glue code.

I think Charles' approach of having as much logic in XQuery as possible will be the best option to pick here. Maybe some of your Python code could as well be rewritten in XQuery, on the other hand this might not even be necessary due to XQuery rewrites as Johannes suggested.

@Ronny, maybe you could provide us with some sample code? In case it is not intended for the general public feel free to send it to support@basex.org.

Looking forward to seeing your code!

Viele Grüße vom Bodensee

Michael

[1] I can sure think of examples that prove me wrong ;-) Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:

...

Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor.

Ronny Möbius

6:14 a.m.

Hi Jonannes, Charles and Michael,

at first thanks for your immediate readiness to help.

I will shortly present the structure of the database:

<Dataset> <Structure> <Institute Name="Physik"> <Degree Abbr="ABC" Name="ABC"> <Module Abbr="HIJ" Name="HIJ">  </Module>  </Degree>  </Institute>  </Structure>  <Lessons> <Lesson ID="12345"> <Name lang="de">Name of a Lesson</Name> <AssociatedModules> <Module Abbr="HIJ"/> <Module Abbr="ABC"/>  </AssociatedModules>  </Lesson> </Lessons> </Dataset>

The task is now to create a list like that: http://vlvz1.physik.hu-berlin.de/ss2012/physik/verzeichnis/en/, that is the whole structure, but only with Modules, where are in fact associated lessons.

The current query looks like this:

let $lang := data($ses/lang) let $sem := data($ses/sem) let $inst := data($ses/inst) let $semxml := db:open("vlvz",concat($sem,'.xml')) let $moduleswithlvs := distinct-values($semxml//AssociatedModules/Module/@Abbr) return <span> <div class="struc"> { for $degree in $semxml//Institute[@Name=$ses//inst]/Degree[Modules//Module/@Abbr=$moduleswithlvs] return <div class="indent"> <span class="degree">{data($degree/@Abbr)} {data($degree/@Name)}<br/></span> { for $module in $degree/Modules//Module[(* and */@Abbr=$moduleswithlvs) or @Abbr=$moduleswithlvs] let $leaf := not($module/*) let $depth := functx:depth-of-node($module)-7 return <div class="indent depth{$depth}"> {data($module/@Abbr)} {data($module/@Name)} <br/> { if ($leaf) then <div class="indent"> { for $lesson in vlvz:getlvs($semxml,data($module/@Abbr)) return <div class="lesson"><span class="lessonid">{$lesson/@ID}</span><span class="lessonname">{$lesson/Name[@lang=$ses//lang]}</span><span class="lessonmodules">{string-join($lesson/AssociatedModules/Module/@Abbr,', ')}</span></div>  } </div> else () } </div> } </div> } </div> </span>

I noticed already, that [1] is crucial: This node makes running the query about 10 times longer than with returning an empty sequence There is no difference with respect to just returning <div></div>, its as slow as with its content. I should also mention the function vlvz:getlvs:

declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };

That the queries are bad designed with respect to performance is probably the case: Basicly all what I've done till know with XQuery was just learning by doing.

Beste Grüße aus der Hauptstadt, Ronny

On 03/29/2012 11:00 AM, Michael Seiferle wrote:

...

Hi Ronny,

Hi Johannes & Charles, thanks for joining the conversation.

In my opinion, and speaking officially for BaseX, I'd suppose that XML processing with BaseX databases should almost always[1] be faster than processing the XML sequentially via lxml.

However, performance may vary depending on the actual queries and/or the python glue code.

I think Charles' approach of having as much logic in XQuery as possible will be the best option to pick here. Maybe some of your Python code could as well be rewritten in XQuery, on the other hand this might not even be necessary due to XQuery rewrites as Johannes suggested.

@Ronny, maybe you could provide us with some sample code? In case it is not intended for the general public feel free to send it to support@basex.org mailto:support@basex.org.

Looking forward to seeing your code!

Viele Grüße vom Bodensee

Michael

[1] I can sure think of examples that prove me wrong ;-) Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:

...
Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor.

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Ronny Möbius

6:18 a.m.

I'm sorry, the markup from copy and paste was a bit unexpected, so I send it again.

Hi Jonannes, Charles and Michael,

at first thanks for your immediate readiness to help.

I will shortly present the structure of the database:

The task is now to create a list like that: http://vlvz1.physik.hu-berlin.de/ss2012/physik/verzeichnis/en/, that is the whole structure, but only with Modules, where are in fact associated lessons.

The current query looks like this:

let $lang := data($ses/lang) let $sem := data($ses/sem) let $inst := data($ses/inst) let $semxml := db:open("vlvz",concat($sem,'.xml')) let $moduleswithlvs := distinct-values($semxml//AssociatedModules/Module/@Abbr) return <span> <div class="struc"> { for $degree in $semxml//Institute[@Name=$ses//inst]/Degree[Modules//Module/@Abbr=$moduleswithlvs] return <div class="indent"><span class="degree">{data($degree/@Abbr)} {data($degree/@Name)}<br/></span> { for $module in $degree/Modules//Module[(* and */@Abbr=$moduleswithlvs) or @Abbr=$moduleswithlvs] let $leaf := not($module/*) let $depth := functx:depth-of-node($module)-7 return <div class="indent depth{$depth}"> {data($module/@Abbr)} {data($module/@Name)} <br/> { if ($leaf) then <div class="indent"> { for $lesson in vlvz:getlvs($semxml,data($module/@Abbr)) return <div class="lesson"><span class="lessonid">{$lesson/@ID}</span><span class="lessonname">{$lesson/Name[@lang=$ses//lang]}</span><span class="lessonmodules">{string-join($lesson/AssociatedModules/Module/@Abbr,', ')}</span></div>  } </div> else () } </div> } </div> } </div> </span>

declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };

That the queries are bad designed with respect to performance is probably the case: Basicly all what I've done till know with XQuery was just learning by doing.

Beste Grüße aus der Hauptstadt, Ronny

On 03/29/2012 11:00 AM, Michael Seiferle wrote:

...

Hi Ronny,

Hi Johannes & Charles, thanks for joining the conversation.

In my opinion, and speaking officially for BaseX, I'd suppose that XML processing with BaseX databases should almost always[1] be faster than processing the XML sequentially via lxml.

However, performance may vary depending on the actual queries and/or the python glue code.

I think Charles' approach of having as much logic in XQuery as possible will be the best option to pick here. Maybe some of your Python code could as well be rewritten in XQuery, on the other hand this might not even be necessary due to XQuery rewrites as Johannes suggested.

@Ronny, maybe you could provide us with some sample code? In case it is not intended for the general public feel free to send it to support@basex.org mailto:support@basex.org.

Looking forward to seeing your code!

Viele Grüße vom Bodensee

Michael

[1] I can sure think of examples that prove me wrong ;-) Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:

...
Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor.

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Ronny Möbius

30 Mar 30 Mar

8:06 a.m.

Hi for another time,

I found a big performance killer by myself.

On 03/29/2012 12:18 PM, Ronny Möbius wrote:

...

declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };

Replacing "$semxml//Lesson" by "$semxml/Dataset/Lessons/Lesson" makes a difference of about one third of time spent.

Why is that? I thought, I don't have to care about the inefficiency of "//". Isn’t that handled by indices?

All the best, Ronny

Christian Grün

8:31 a.m.

Dear Ronny,

I would dare claim that BaseX provides one of the most advanced query compilers/optimizers for XQuery which you can find out there. Still, due to the complexity of the language, there are many cases in which queries can be further sped up by manually rewriting them. This is just what you did (and as we do for commercial customers).

In the particular case, we don't know at compile time which nodes will be bound to the $semxml variable. It could even be that the bound nodes will refer to different databases, in which case we'd have to provide different path rewritings for the same original path.

Hope this helps, Christian Am 30.03.2012 14:06 schrieb "Ronny Möbius" moebius@physik.hu-berlin.de:

...

Hi for another time,

I found a big performance killer by myself.

On 03/29/2012 12:18 PM, Ronny Möbius wrote:

...
declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };

Replacing "$semxml//Lesson" by "$semxml/Dataset/Lessons/Lesson" makes a difference of about one third of time spent.

Why is that? I thought, I don't have to care about the inefficiency of "//". Isn’t that handled by indices?

All the best, Ronny _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

4857

Age (days ago)

4859

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

7 comments

5 participants

tags (0)

participants (5)

Charles Duffy
Christian Grün
Johannes.Lichtenberger
Michael Seiferle
Ronny Möbius