Hi,
we use now for a couple of years XML files for organizing the schedule of lectures in our institute. The amount data for one semester is about one megabyte. For presenting the schedule on the web we use currently python in combination with the C-based lxml XPath implementation. Only basic selecting nodes happens on the layer of lxml, but the most of intelligence is implemented in filtering python lists and lists of dictionaries. We read the xml data from hard disk and parse them into lxml objects.
From time to time we thought about storing the data in a XML database as
this seems intuitive. One main point is, that we need some sort of transaction management, because different applications may manipulate the files simultaneously.
Now I gave BaseX a try and implemented one basic output of our online schedule in XQuery. Soon I noticed, that all of the intelligence should be managed inside XQuery as it should return the full-prepared html for presenting online.
Unfortunately we now have the impression, that we did not gain speed - to the contrary the query itself needs more execution time than the whole corresponding ready interface (including the user interface etc.)
I'm now interested in your general opinion about this: Is it surprising, that the XQuery implementation than the lxml/python one (For me it is, as I thought the indices etc. created when importing the data should decrease computational affords in searching the tree)? Is there some catch in my approach? May the reason be bad designing the query?
I would appreciate to hear from you, Ronny
On 03/28/2012 01:49 PM, Ronny Möbius wrote:
I'm now interested in your general opinion about this: Is it surprising, that the XQuery implementation than the lxml/python one (For me it is, as I thought the indices etc. created when importing the data should decrease computational affords in searching the tree)? Is there some catch in my approach? May the reason be bad designing the query?
Howdy, Ronny --
I can't speak for the BaseX team, but I can certainly say that the specific queries in use make a very, very big difference.
(Personally, by the way, I'm in the process of moving as much logic as I can into code running in BaseX simply because using XQuery 3.0 makes it hard to go back to the minimal XPath 1.0 implementation that libxml2, and thus lxml, supports).
On 03/28/2012 08:49 PM, Ronny Möbius wrote:
I'm now interested in your general opinion about this: Is it surprising, that the XQuery implementation than the lxml/python one (For me it is, as I thought the indices etc. created when importing the data should decrease computational affords in searching the tree)? Is there some catch in my approach? May the reason be bad designing the query?
I suppose it relies heavily on the query and the query processor, rewritings, cost based analysis and index structures. Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor. Maybe you can also create a database in memory which doesn't have to be persisted (plus deactivating certain indexes which are not needed).
kind regards, Johannes
Hi Ronny,
Hi Johannes & Charles, thanks for joining the conversation.
In my opinion, and speaking officially for BaseX, I'd suppose that XML processing with BaseX databases should almost always[1] be faster than processing the XML sequentially via lxml.
However, performance may vary depending on the actual queries and/or the python glue code.
I think Charles' approach of having as much logic in XQuery as possible will be the best option to pick here. Maybe some of your Python code could as well be rewritten in XQuery, on the other hand this might not even be necessary due to XQuery rewrites as Johannes suggested.
@Ronny, maybe you could provide us with some sample code? In case it is not intended for the general public feel free to send it to support@basex.org.
Looking forward to seeing your code!
Viele Grüße vom Bodensee
Michael
[1] I can sure think of examples that prove me wrong ;-) Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:
Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor.
Hi Jonannes, Charles and Michael,
at first thanks for your immediate readiness to help.
I will shortly present the structure of the database:
<Dataset> <Structure> <Institute Name="Physik"> <Degree Abbr="ABC" Name="ABC"> <Module Abbr="HIJ" Name="HIJ"> <!-- the Module nodes are arbirtrary nested in themselves --> </Module> <!-- more Module nodes --> </Degree> <!-- more Degree nodes --> </Institute> <!--more Institute nodes--> </Structure> <!-- other informations --> <Lessons> <Lesson ID="12345"> <Name lang="de">Name of a Lesson</Name> <AssociatedModules> <Module Abbr="HIJ"/> <Module Abbr="ABC"/> <!-- there are 1..unbounded Modules per Lesson, only modules containing no modules are referenced --> </AssociatedModules> <!-- othere informations --> </Lesson> </Lessons> </Dataset>
The task is now to create a list like that: http://vlvz1.physik.hu-berlin.de/ss2012/physik/verzeichnis/en/, that is the whole structure, but only with Modules, where are in fact associated lessons.
The current query looks like this:
let $lang := data($ses/lang) let $sem := data($ses/sem) let $inst := data($ses/inst) let $semxml := db:open("vlvz",concat($sem,'.xml')) let $moduleswithlvs := distinct-values($semxml//AssociatedModules/Module/@Abbr) return <span> <div class="struc"> { for $degree in $semxml//Institute[@Name=$ses//inst]/Degree[Modules//Module/@Abbr=$moduleswithlvs] return <div class="indent"> <span class="degree">{data($degree/@Abbr)} {data($degree/@Name)}<br/></span> { for $module in $degree/Modules//Module[(* and */@Abbr=$moduleswithlvs) or @Abbr=$moduleswithlvs] let $leaf := not($module/*) let $depth := functx:depth-of-node($module)-7 return <div class="indent depth{$depth}"> {data($module/@Abbr)} {data($module/@Name)} <br/> { if ($leaf) then <div class="indent"> { for $lesson in vlvz:getlvs($semxml,data($module/@Abbr)) return <div class="lesson"><span class="lessonid">{$lesson/@ID}</span><span class="lessonname">{$lesson/Name[@lang=$ses//lang]}</span><span class="lessonmodules">{string-join($lesson/AssociatedModules/Module/@Abbr,', ')}</span></div> <!-- note [1] --> } </div> else () } </div> } </div> } </div> </span>
I noticed already, that [1] is crucial: This node makes running the query about 10 times longer than with returning an empty sequence There is no difference with respect to just returning <div></div>, its as slow as with its content. I should also mention the function vlvz:getlvs:
declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };
That the queries are bad designed with respect to performance is probably the case: Basicly all what I've done till know with XQuery was just learning by doing.
Beste Grüße aus der Hauptstadt, Ronny
On 03/29/2012 11:00 AM, Michael Seiferle wrote:
Hi Ronny,
Hi Johannes & Charles, thanks for joining the conversation.
In my opinion, and speaking officially for BaseX, I'd suppose that XML processing with BaseX databases should almost always[1] be faster than processing the XML sequentially via lxml.
However, performance may vary depending on the actual queries and/or the python glue code.
I think Charles' approach of having as much logic in XQuery as possible will be the best option to pick here. Maybe some of your Python code could as well be rewritten in XQuery, on the other hand this might not even be necessary due to XQuery rewrites as Johannes suggested.
@Ronny, maybe you could provide us with some sample code? In case it is not intended for the general public feel free to send it to support@basex.org mailto:support@basex.org.
Looking forward to seeing your code!
Viele Grüße vom Bodensee
Michael
[1] I can sure think of examples that prove me wrong ;-) Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:
Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
I'm sorry, the markup from copy and paste was a bit unexpected, so I send it again.
Hi Jonannes, Charles and Michael,
at first thanks for your immediate readiness to help.
I will shortly present the structure of the database:
<Dataset> <Structure> <Institute Name="Physik"> <Degree Abbr="ABC" Name="ABC"> <Module Abbr="HIJ" Name="HIJ"> <!-- the Module nodes are arbirtrary nested in themselves --> </Module> <!-- more Module nodes --> </Degree> <!-- more Degree nodes --> </Institute> <!--more Institute nodes--> </Structure> <!-- other informations --> <Lessons> <Lesson ID="12345"> <Name lang="de">Name of a Lesson</Name> <AssociatedModules> <Module Abbr="HIJ"/> <Module Abbr="ABC"/> <!-- there are 1..unbounded Modules per Lesson, only modules containing no modules are referenced --> </AssociatedModules> <!-- othere informations --> </Lesson> </Lessons> </Dataset>
The task is now to create a list like that: http://vlvz1.physik.hu-berlin.de/ss2012/physik/verzeichnis/en/, that is the whole structure, but only with Modules, where are in fact associated lessons.
The current query looks like this:
let $lang := data($ses/lang) let $sem := data($ses/sem) let $inst := data($ses/inst) let $semxml := db:open("vlvz",concat($sem,'.xml')) let $moduleswithlvs := distinct-values($semxml//AssociatedModules/Module/@Abbr) return <span> <div class="struc"> { for $degree in $semxml//Institute[@Name=$ses//inst]/Degree[Modules//Module/@Abbr=$moduleswithlvs] return <div class="indent"><span class="degree">{data($degree/@Abbr)} {data($degree/@Name)}<br/></span> { for $module in $degree/Modules//Module[(* and */@Abbr=$moduleswithlvs) or @Abbr=$moduleswithlvs] let $leaf := not($module/*) let $depth := functx:depth-of-node($module)-7 return <div class="indent depth{$depth}"> {data($module/@Abbr)} {data($module/@Name)} <br/> { if ($leaf) then <div class="indent"> { for $lesson in vlvz:getlvs($semxml,data($module/@Abbr)) return <div class="lesson"><span class="lessonid">{$lesson/@ID}</span><span class="lessonname">{$lesson/Name[@lang=$ses//lang]}</span><span class="lessonmodules">{string-join($lesson/AssociatedModules/Module/@Abbr,', ')}</span></div> <!-- note [1] --> } </div> else () } </div> } </div> } </div> </span>
I noticed already, that [1] is crucial: This node makes running the query about 10 times longer than with returning an empty sequence There is no difference with respect to just returning <div></div>, its as slow as with its content. I should also mention the function vlvz:getlvs:
declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };
That the queries are bad designed with respect to performance is probably the case: Basicly all what I've done till know with XQuery was just learning by doing.
Beste Grüße aus der Hauptstadt, Ronny
On 03/29/2012 11:00 AM, Michael Seiferle wrote:
Hi Ronny,
Hi Johannes & Charles, thanks for joining the conversation.
In my opinion, and speaking officially for BaseX, I'd suppose that XML processing with BaseX databases should almost always[1] be faster than processing the XML sequentially via lxml.
However, performance may vary depending on the actual queries and/or the python glue code.
I think Charles' approach of having as much logic in XQuery as possible will be the best option to pick here. Maybe some of your Python code could as well be rewritten in XQuery, on the other hand this might not even be necessary due to XQuery rewrites as Johannes suggested.
@Ronny, maybe you could provide us with some sample code? In case it is not intended for the general public feel free to send it to support@basex.org mailto:support@basex.org.
Looking forward to seeing your code!
Viele Grüße vom Bodensee
Michael
[1] I can sure think of examples that prove me wrong ;-) Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:
Thus I suppose it would be the best to write the queries in a reply, such that the BaseX team can make suggestions for similar queries which better utilize index-structures and the query optimizations from the query processor.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi for another time,
I found a big performance killer by myself.
On 03/29/2012 12:18 PM, Ronny Möbius wrote:
declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };
Replacing "$semxml//Lesson" by "$semxml/Dataset/Lessons/Lesson" makes a difference of about one third of time spent.
Why is that? I thought, I don't have to care about the inefficiency of "//". Isn’t that handled by indices?
All the best, Ronny
Dear Ronny,
I would dare claim that BaseX provides one of the most advanced query compilers/optimizers for XQuery which you can find out there. Still, due to the complexity of the language, there are many cases in which queries can be further sped up by manually rewriting them. This is just what you did (and as we do for commercial customers).
In the particular case, we don't know at compile time which nodes will be bound to the $semxml variable. It could even be that the bound nodes will refer to different databases, in which case we'd have to provide different path rewritings for the same original path.
Hope this helps, Christian Am 30.03.2012 14:06 schrieb "Ronny Möbius" moebius@physik.hu-berlin.de:
Hi for another time,
I found a big performance killer by myself.
On 03/29/2012 12:18 PM, Ronny Möbius wrote:
declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string) as node()* { for $l in $semxml//Lesson where $l[AssociatedModules/Module/@Abbr=$modabbr] order by data($l/@ID) return $l };
Replacing "$semxml//Lesson" by "$semxml/Dataset/Lessons/Lesson" makes a difference of about one third of time spent.
Why is that? I thought, I don't have to care about the inefficiency of "//". Isn’t that handled by indices?
All the best, Ronny _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de