BaseX-Talk

basex-talk@mailman.uni-konstanz.de

9 participants
5241 discussions

Bug (?) with dynamic namespace constructor
by Hans-Juergen Rennau 31 Jan '22

31 Jan '22

Dear BaseX people, is this a bug: basex "<foo>{namespace {''}{'bar'}}</foo>"=>[XQDY0102] Duplicate namespace declaration: ''.? This works as expected:basex "<foo>{namespace {'xxx'}{'bar'}}</foo>"=><foo xmlns:xxx="bar"/> With kind regards,Hans-Jürgen

2 4

Techniques for Parallelizing Updating Operations?
by Eliot Kimber 30 Jan '22

30 Jan '22

I’ve worked out how to optimize my process that indexes DITA topics based on what top-level maps they are ultimately used from (turned out I needed to first index the maps in ref count order from least to most, which meant I could then just look up the top-level maps used by any direct-reference maps that reference a given topic—with that in place each topic only requires a single index lookup). However, on my laptop these lookups still take about 0.1 second/topic so for 1000s of topics it’s a long time (relatively speaking). But the topic index process is 100% parallelizable, so I would be able to have at least 2 or 3 ingestion threads going on my 4-CPU server machine. Note that my ingestion process is two-phased: Phase 1: Construct an XQuery map with the index details for the input topics (the topics already exist in the database, only the index is new). Phrase 2: Persist the map to the database as XML elements. I do the map construction in order to both take advantage of map:merge() and because it’s the only way I can do indexing of the DITA maps and topics in one transaction: build the doc-to-root-map for the DITA maps and then use that data to build the doc-to-root-map entries for all the topics, then persist the lot to the database for future use. This is in the context of a one-time mass load of content from a new git work tree. Subsequent changes to the content database will be on individual files and the index can be easily updated incrementally. So I’m just trying to optimize the startup time so that it doesn’t take two hours to load and index our typical content set. I can also try to optimize the low-level operations, although they’re pretty simple so I don’t see much opportunity for significant improvement, but I also haven’t had time to try different options and measure them. I must also say how useful the built-in unit testing framework is—that’s really made this work easier. Cheers, Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.com<https://www.servicenow.com> LinkedIn<https://www.linkedin.com/company/servicenow> | Twitter<https://twitter.com/servicenow> | YouTube<https://www.youtube.com/user/servicenowinc> | Facebook<https://www.facebook.com/servicenow>

1 0

Managing Interactions with Long-Running Processes
by Eliot Kimber 26 Jan '22

26 Jan '22

I’m making good progress on our BaseX-based validation dashboard application. The basic process here is we use Oxygen’s scripting support to do DITA map validation and then ingest the result into a database (along with the content documents that were validated) and provide reports on the validation result. The practical challenge here seems to be running the Oxygen process successfully from BaseX—because our content is so huge it can take 10s of minutes for the Oxygen process to run. I set the command time-out to be much longer than the process should run but running it from the HTTP app’s query panel it eventually failed with an error that wasn’t a time out (my code had earlier reported legit errors so I know errors will be properly reported). As soon as the Oxygen process ends I want to ingest the resulting XML file, which is why I started with doing it from within BaseX. But I’m wondering if this is a bad idea and I should really be doing it with i.e., a shell script run via cron or some such? I was trying to keep everything in BaseX as much as possible just to keep it simple. Any general tips or cautions for this time of integration of BaseX with the outside world? Thanks, E. _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.com<https://www.servicenow.com> LinkedIn<https://www.linkedin.com/company/servicenow> | Twitter<https://twitter.com/servicenow> | YouTube<https://www.youtube.com/user/servicenowinc> | Facebook<https://www.facebook.com/servicenow>

2 2

Re: [basex-talk] Unable to Make My Custom EXPath Module Work
by Christian Grün 24 Jan '22

24 Jan '22

I guess you’re right. We haven’t revised EXPath packaging for a long time now. Actually, I’m not sure how many people use it at all ;) Anyone out there? On Mon, Jan 24, 2022 at 4:08 PM Eliot Kimber <eliot.kimber(a)servicenow.com> wrote: > > I did confirm that if the package @name URI matches the URI of a module, then the module is resolved, i.e.: > > <package xmlns="http://expath.org/ns/pkg" > name="http://servicenow.com/xquery/module/now-dita-utils" > abbrev="now-xquery" > version="0.1" spec="1.0"> > <title>XQuery modules for ServiceNow Product Content processing and support</title> <xquery> > <namespace>http://servicenow.com/xquery/module/database-from-git</namespace> > <file>database-from-git.xqm</file> > </xquery> > > … > > </package> > > That implies that each module needs to be in a separate XAR file in order to also be in a separate namespace. > > I don’t think that is consistent with the EXPath packaging spec. > > In the description of the XQuery entry it says: > > An XQuery library module is referenced by its namespace URI. Thus the xquery element associates a namespace URI to an XQuery file. An importing module just need to use an import statement of the form import module namespace xx = "<namespace-uri>";. > An XQuery main module is associated a public URI. Usually an XQuery package will provide functions through library modules, but in some cases one can want to provide main modules as well. > This implies to me that the value of the <namespace> element in the <xquery> is what should be used to resolve the package reference, not the package’s @name value, which simply serves to identify the package. > > Is my analysis correct or have I misunderstood the package mechanism? > > Cheers, > E.

3 6

Strategy for Persisting Maps that Contain Nodes: db:node-id()
by Eliot Kimber 24 Jan '22

24 Jan '22

I have large maps that include nodes as entry values. I want to persist these maps in the DB for subsequent retrieval. I believe the best/only strategy is: 1. Construct a new map where each node in a value is replaced by its node-id() value. 2. Serialize the map to JSON and store as a blob To reconstitute it, do the reverse (parse JSON into map, replace node-ids() with nodes). Is my analysis correct? Have I missed any detail or better approach? Thanks, Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.com<https://www.servicenow.com> LinkedIn<https://www.linkedin.com/company/servicenow> | Twitter<https://twitter.com/servicenow> | YouTube<https://www.youtube.com/user/servicenowinc> | Facebook<https://www.facebook.com/servicenow>

2 6

Unable to Make My Custom EXPath Module Work
by Eliot Kimber 24 Jan '22

24 Jan '22

I’m packaging some XQuery modules into an EXPath XAR archive and then installing them using the repo install command. The command succeeds and my repo is listed by repo list but I’m unable to get the modules to import and I can’t see where I’ve gone wrong. I must have made some non-obvious (to me) error but so far I have not found it. I’ve checked that my expath-pkg.xml against the working examples I have (functx and Schematron BaseX) and I don’t see any difference. I’ve also carefully checked all my namespace URIs to make sure they match. Here is my expath-pkg.xml: <package xmlns="http://expath.org/ns/pkg" name="http://servicenow.com/xquery" abbrev="now-xquery" version="0.1" spec="1.0"> <title>XQuery modules for ServiceNow Product Content processing and support</title> <xquery> <namespace>http://servicenow.com/xquery/module/database-from-git</namespace> <file>database-from-git.xqm</file> </xquery> <xquery> <namespace>http://servicenow.com/xquery/module/now-dita-utils</namespace> <file>now-dita-utils.xqm</file> </xquery> <xquery> <namespace>http://servicenow.com/xquery/module/now-relpath-utils</namespace> <file>now-relpath-utils.xqm</file> </xquery> </package> And here is the structure of the resulting module directory after installation: main % ls ~/apps/basex/repo/http-servicenow.com-xquery-0.1 content expath-pkg.xml main % ls ~/apps/basex/repo/http-servicenow.com-xquery-0.1/content database-from-git.xqm now-dita-utils.xqm now-relpath-utils.xqm main % And the prolog for now-relpath-utils.xqm: module namespace relpath="http://servicenow.com/xquery/module/now-relpath-utils"; Trying to import in a trivial XQuery script, e.g.: import module namespace relpath="http://servicenow.com/xquery/module/now-relpath-utils"; count(/*) Produces this error: Error: Stopped at /Users/eliot.kimber/git/dita-build-tools/src/main/xquery/file, 1/88: [XQST0059] Module not found: http://servicenow.com/xquery/module/now-relpath-utils. So clearly something isn’t hooked up correctly but I don’t see any obvious breakage or violation of a required convention. Any idea where I’ve gone wrong or tips on debugging the resolution failure? This is 9.6.4 on macOS. Thanks, Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.com<https://www.servicenow.com> LinkedIn<https://www.linkedin.com/company/servicenow> | Twitter<https://twitter.com/servicenow> | YouTube<https://www.youtube.com/user/servicenowinc> | Facebook<https://www.facebook.com/servicenow>

2 1

Response from proc:execute() for git Command Not Coming Through as Expected
by Eliot Kimber 22 Jan '22

22 Jan '22

[BaseX 9.6.4 on Mac through BaseX GUI] I’m trying to automate syncing a database with content from a git repo that is locally cloned. My approach is to use proc:exec() to pull the repo and then use git diff-tree to see what files changed: let $pullResult := proc:execute($gitCmd,('pull', '-v'), map{'dir' : $repoPath}) let $changeList := proc:execute($gitCmd, ('diff-tree', '--no-commit-id', '--name-status', 'HEAD'), map{'dir' : $repoPath}) However, the result returned for “diff-tree” appears to get truncated. Here’s a typical result, where I’m echoing out the $pullResult and $changeList values created above: Pull result: <result> <output>Already up to date. </output> <error>From code.devsnc.com:doc/dita-now = [up to date] master -> origin/master = [up to date] dita-now -> origin/dita-now = [up to date] rtprn -> origin/rtprn = [up to date] sanfransokyo -> origin/sanfransokyo = [up to date] scodefreeze -> origin/scodefreeze = [up to date] scratc/table_issues_fix -> origin/scratc/table_issues_fix = [up to date] scratch/fix_canvas_issue -> origin/scratch/fix_canvas_issue = [up to date] scratch/newDitavals2022 -> origin/scratch/newDitavals2022 = [up to date] scratch/simplifyDitavals -> origin/scratch/simplifyDitavals = [up to date] scratch/table_name_issue -> origin/scratch/table_name_issue </error> <code>0</code> </result> Change list: <result> <output>M doc </output> <code>0</code> </result> Note that the pull response looks as expected but the change list response is just “M\tdoc” where “doc” is the first directory in what should be a path to a file. Here’s the same result from the command line: dita-now % git diff-tree --no-commit-id --name-status -r HEAD M doc/source/product/rpa-studio/task/use-datareader-queryexcel.dita M doc/source/product/rpa-studio/task/use-datetime-add.dita M doc/source/product/rpa-studio/task/use-datetime-compare.dita … I can’t see anything I’m doing wrong or options to the execute() or system() functions that would affect the result. Any idea what might be causing this or how I could work around it? Thanks, Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.com<https://www.servicenow.com> LinkedIn<https://www.linkedin.com/company/servicenow> | Twitter<https://twitter.com/servicenow> | YouTube<https://www.youtube.com/user/servicenowinc> | Facebook<https://www.facebook.com/servicenow>

3 3

Outdated milton-api version
by Matthew Dziuban 21 Jan '22

21 Jan '22

Hi all, While auditing the library dependencies of an application, I found that org.basex:basex-api [1] depends on a very old version of com.ettrema:milton-api -- v1.8.1.4 released in 2014 [2]. The package seems to have been migrated to the io.milton namespace and have a recently (December 2021) released v3.1.0.301 [3]. This stood out to me because my application is using the latest version of Apache commons-io, v2.11.0, while the old version of milton-api depends on Apache commons-io v1.4, released in 2008. Would it be feasible to migrate BaseX to a newer version of the milton-api library? Thanks in advance! Matt [1] https://mvnrepository.com/artifact/org.basex/basex-api/9.6.4 [2] https://mvnrepository.com/artifact/com.ettrema/milton-api/1.8.1.4 [3] https://mvnrepository.com/artifact/io.milton/milton-api/3.1.0.301

2 1

Where-Used: Performance Improvement Strategies?
by Eliot Kimber 18 Jan '22

18 Jan '22

In the context of my 40K topic DITA corpus, I’m trying to build a “where used” report that finds, for each topic, the other topics that directly refer to the topic. I can do this by looking for the target topic’s filename in the values of @href attributes in other topics (I’m taking advantage of a local rule we have where all topic filenames should be unique). My current naive approach is simply: $topics//*[tokenize(@href, '/') = $filename] Where $topics is the 40K topics. Based on profiling, the use of tokenize() is slightly faster than either matches() or contains(), but all forms take about 0.5 seconds per target topic, which is way too slow to make this practical in practice. So I’m trying to work out what my performance optimization strategies are in BaseX. In MarkLogic I would set up an index so I could do fast lookup of tokens in @href values or something similar (it’s been long enough since I had to optimize MarkLogic queries that I don’t remember the details but basically indexes for everything). I know I could do a one-time construction of the where-used table and then use that for quick lookup for subsequent queries but I’m trying to find a solution that is more appropriate for my current “create a new database with the latest files from git and run some queries quickly to get a report” mode. I suspect that using full-text indexing may be a solution here but wondering what other performance optimization options I have for this kind of look up. Thinking about it now I definitely need to see if building the where-used table would actually be slower. That is, find every @href, resolve it and construct a map of topics to href elements that point to that topic. Hmm. Anyway, any guidance on this challenge would be appreciated. Cheers, Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.com<https://www.servicenow.com> LinkedIn<https://www.linkedin.com/company/servicenow> | Twitter<https://twitter.com/servicenow> | YouTube<https://www.youtube.com/user/servicenowinc> | Facebook<https://www.facebook.com/servicenow>

4 5

Questions about the performance of bulk string matching
by Zimmel, Daniel 18 Jan '22

18 Jan '22

Hi, recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why. Here is my scenario: * XML text documents with a total of 1m text nodes * A regex search string, including a huge string dictionary list of 50.000 strings (some of them containing 50 words each) * a match must be wrapped in an element (= create a link) I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea! Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-| So, I tried tinkering with maps which turned out to give me extraordinary performance gains: * build a map from the string dictionary * walk through all text nodes one by one * for each text node, put any combination of words in the text node in word order (I need to find strings, not words) into another map * strip punctuation (for word boundaries) and do some normalization of whitespaces in both maps * compare the keys of both maps * give the new reduced string dictionary to the regular regex search While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast! I am quite happy BUT I still cannot understand why this is so much faster for my sample data: * plain method : 51323.72ms * maps method: 597.94ms(!) Is this due to any optimizations done by BaseX or have I simply discovered the power of maps? How would you do it? Is there still room for improvement? Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive? Why is this so much faster with text-splitting-to-map? See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3] There is no normalization of punctuation in the examples but that is only necessary for completeness. Best, Daniel [1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Ho… [2] plain method let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $regex := $strings for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode [3] maps method (:~ : Create map from string : Tokenization : : @param $strings : @return Map :) declare function local:createMapFromString ( $string as xs:string ) as map(*) { let $map_words := map:merge( for $string in tokenize($string,'\|') let $key := $string let $val := $string return map:entry($key,$val), map { 'duplicates': 'use-first' }) return $map_words }; (:~ : Create map from text nodes : Write any combination of words in document order to the map : : @param $textnodes : @return Map :) declare function local:createMapFromTextnodes ( $textnodes as xs:string+ ) as map(*) { map:merge( for $node in $textnodes let $text := normalize-space($node) let $tokens := tokenize($text,' ') let $map_nodes := map:merge( for $start in 1 to fn:count($tokens) for $length in 1 to fn:count($tokens) - $start + 1 return map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x') ) return $map_nodes) }; (:~ : Compare two maps : : @param $map1 : @param $map2 : @return xs:string* :) declare function local:reduceMaps ( $map1 as map(*), $map2 as map(*) ) as xs:string* { for $key in map:keys($map1) where map:contains($map2,$key) let $value := map:get($map1,$key) return $value }; let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $map_words := local:createMapFromString($strings) let $map_textnodes := local:createMapFromTextnodes($textnodes) let $matches := local:reduceMaps($map_words,$map_textnodes) let $regex := string-join(for $match in $matches group by $match return $match,'|') for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode

2 4

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

BaseX-Talk