Re: [basex-talk] performance of preceding/following axis

18 Jun 2015


      Hello Daniel,
I don't have much time right now, but maybe a few pointers to get you started. I didn't test any of this, so take it with a grain of salt.
However, I guess your subsequence solution is not performing optimal, as I would guess that there really is a new sequence created. So for 50.000 matches you have to create 100.000 new sequences, which is kind of costly. Instead I would recommend using position() to compare the element positions instead and get your window this way. This can operate directly on your data.
Also, did you know that there is a window expression in XQuery 3 (see http://www.w3.org/TR/xquery-30/#id-windows for more)? Looks like an optimal use case here and should also perform much better than subsequences.
Hope this helps,
Dirk
On 06/18/2015 12:39 PM, Schopper, Daniel wrote:
...
Hi,
I'm trying to use BaseX for linguistic queries on a TEI document containing annotated tokens (i.e. tei:w-elements with attributes). I'm specifically interested in distance queries that allow to search for combinations of token features within a given window (e.g. all nouns that have an adjective ending with 'lein' within a distance of 3.) Theoretically, this is rather easy to formulate with an XPath or XQuery expression, but performance is poor when the dataset gets a bit larger (in my case, I have a total of 190.000 tokens in my test document, attribute and text indexes created).
This is what I essentially try to do as a simple XPath:
declare default element namespace  "http://www.tei-c.org/ns/1.0";
//w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
Since tokens may be interwoven with markup, I have to use preceding::* or following::*
A simple XQuery returning all matches including their context would look like this:
declare default element namespace  "http://www.tei-c.org/ns/1.0";
let $window := 3
let $matches := //w[@type = "NN"]
return
for $m in $matches
let $pre := subsequence($m/preceding-sibling::w, 1, $window)
let $next := subsequence($m/following-sibling::w, 1, $window)
return
  if (($pre,$next)[@type = "ADJA"])
  then
    <conc>
      <pre>{$pre}</pre>
      <match>{$m}</match>
      <next>{$next}</next>
    </conc>
  else ()
With $matches being a sequence of ca. 50.000 elements, a FLOWR is a bit too costly, I fear; limiting $matches to ~ 1.000 items performs within 5600ms (returning 154 items), but performance decreases rapidly after that (not to speak about setting a larger distance) .
So, my question is: Is there a way to improve performance on operations like these (without resorting to changing the input document)?
I'd be glad to provide my dataset off list, if this helps.
Thanks,
Daniel
-- 
Dirk Kirsten, BaseX GmbH, http://basexgmbh.de
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
| Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] performance of preceding/following axis