Hi Cerstin,
[…] So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.
Could this be done using BaseX or should I rather do some Perl-scripting?
a straightforward solution could look as follows: _________________________
declare option output:separator '\n'; declare variable $texts1 := db:open('Text-DB')//text(); declare variable $texts2 := db:open('Text-DB-WS')//text();
for $text1 in $texts1 let $str1 := normalize-space($text1) let $id1 := db:node-id($text1) return $id1 || ': ' || string-join( for $text2 in $texts2 where $str1 = normalize-space($text2) return string(db:node-id($text2)) , ',') _________________________
The query retrieves all text nodes of the two databases. In a nested loop, all strings are compared against each other, and the resulting output will list the ids of the text nodes of the first document, followed by the ids of matchings texts of the second node:
3: 4,13 5: 7 7: 10 9: 4,13
If the database is too large, however, this approach may be too slow due to its O(n²) runtime. In that case, XQuery maps or the "group by" statement could probably be used to reduce the number of comparisons.
I hope this serves as a first inspiration, Christian