Hello Daniel,
I don't have much time right now, but maybe a few pointers to get you started. I didn't test any of this, so take it with a grain of salt.
However, I guess your subsequence solution is not performing optimal, as I would guess that there really is a new sequence created. So for 50.000 matches you have to create 100.000 new sequences, which is kind of costly. Instead I would recommend using position() to compare the element positions instead and get your window this way. This can operate directly on your data.
Also, did you know that there is a window expression in XQuery 3 (see http://www.w3.org/TR/xquery-30/#id-windows for more)? Looks like an optimal use case here and should also perform much better than subsequences.
Hope this helps, Dirk
On 06/18/2015 12:39 PM, Schopper, Daniel wrote:
Hi, I'm trying to use BaseX for linguistic queries on a TEI document containing annotated tokens (i.e. tei:w-elements with attributes). I'm specifically interested in distance queries that allow to search for combinations of token features within a given window (e.g. all nouns that have an adjective ending with 'lein' within a distance of 3.) Theoretically, this is rather easy to formulate with an XPath or XQuery expression, but performance is poor when the dataset gets a bit larger (in my case, I have a total of 190.000 tokens in my test document, attribute and text indexes created).
This is what I essentially try to do as a simple XPath:
declare default element namespace "http://www.tei-c.org/ns/1.0"; //w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
Since tokens may be interwoven with markup, I have to use preceding::* or following::*
A simple XQuery returning all matches including their context would look like this:
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $window := 3 let $matches := //w[@type = "NN"] return
for $m in $matches let $pre := subsequence($m/preceding-sibling::w, 1, $window) let $next := subsequence($m/following-sibling::w, 1, $window) return if (($pre,$next)[@type = "ADJA"]) then <conc> <pre>{$pre}</pre> <match>{$m}</match> <next>{$next}</next> </conc> else ()
With $matches being a sequence of ca. 50.000 elements, a FLOWR is a bit too costly, I fear; limiting $matches to ~ 1.000 items performs within 5600ms (returning 154 items), but performance decreases rapidly after that (not to speak about setting a larger distance) . So, my question is: Is there a way to improve performance on operations like these (without resorting to changing the input document)?
I'd be glad to provide my dataset off list, if this helps.
Thanks, Daniel