Re: [basex-talk] whitespace

25 Jun 2012


      Hi Cerstin,
...
[…]
So my idea was to have the original Text-DB (without whitespace) and the new
Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB
have corresponding nodes in Text-DB-WS, they only differ concerning the
node-id.  So I should be able to detect which node-id of Text-DB corresponds
to which node-id of Text-DB-WS.  And then I could create a new version of
Collect-DB by replacing the value of all "node" elements with the respective
node-id from Text-DB-WS.
Could this be done using BaseX or should I rather do some Perl-scripting?
a straightforward solution could look as follows:
_________________________
declare option output:separator '\n';
declare variable $texts1 := db:open('Text-DB')//text();
declare variable $texts2 := db:open('Text-DB-WS')//text();
for $text1 in $texts1
let $str1 := normalize-space($text1)
let $id1 := db:node-id($text1)
return $id1 || ': ' || string-join(
  for $text2 in $texts2
  where $str1 = normalize-space($text2)
  return string(db:node-id($text2))
, ',')
_________________________
The query retrieves all text nodes of the two databases. In a nested
loop, all strings are compared against each other, and the resulting
output will list the ids of the text nodes of the first document,
followed by the ids of matchings texts of the second node:
3: 4,13
5: 7
7: 10
9: 4,13
If the database is too large, however, this approach may be too slow
due to its O(n²) runtime. In that case, XQuery maps or the "group by"
statement could probably be used to reduce the number of comparisons.
I hope this serves as a first inspiration,
Christian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] whitespace