Thank you Ron and Christian for the suggestions! I’ll give both a try. Will the map solution cause BaseX to hold every item from both databases in memory?
Best, Matt
On Wed, Jun 4, 2025 at 3:27 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Matt,
Ron has already given you a perfect answer on how to utilize maps. Another slightly slower solution is to address the database twice, and let the compiler do its job:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := db:get('OriginalDB')/data/item[@id = $id] let $update := db:get('UpdateDB')/data/item[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update })
This may seem counterintuitive at first glance, but the expression "db:get('OriginalDB')/data/item[@id = $id]" is something that the optimizer will rewrite for index access. It is equivalent to:
let $original := db:attribute("OriginalDB", $id)/self::attribute(id)/parent::item let $update := db:attribute("UpdateDB", $id)/self::attribute(id)/parent::item
Hope this helps, Christian
On Tue, Jun 3, 2025 at 11:10 PM Matt Dziuban mrdziuban@gmail.com wrote:
Hi all,
I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:
<data> <item id="1">...</item> <item id="2">...</item> ... </data>
OriginalDB has 2,305,570 items and UpdateDB has 307,019.
The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })
In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?
Thanks in advance, Matt