I am sure someone more knowledgable about BaseX specifics will answer, too, but in general, you have to try not to step on the optimizer's feet.

1. don't call db:open() twice; assign it to a variable,

let $sourceData as document-node()+ := db:open('FAERS')

and then use the variable.

2. don't do math in the binding expression of a for clause!  Also, don't do math twice.  Stick things in variables.

3. grouping is more efficient than distinct-values

So maybe

for $weight in $sourceData/ichicsr/safetyreport/patient/patientweight
   let $numeric as xs:integer := (($weight div 10.0) => round()) * 10
   group by $numeric
  return concat($numeric, count($weight))

Not tested; I just typed that in, and I might not have understood what you were trying to do.  But take advantage of the tuple stream processing whenever you can.

-- Graydon

On Wed, Mar 10, 2021 at 3:37 PM Ron Katriel <rkatriel@mdsol.com> wrote:
Hi,

I would appreciate your advice on optimizing a query against a large BaseX (9.2.4) database. It is loaded with data from the FDA’s Adverse Event Reporting System (FAERS). Currently this is just the 2020 dataset which comprises 12 documents stored as 308,870,597 nodes (6,565 MB).

The queries below effectively - though not necessarily efficiently - implement a histogram. The first, which is applied to patient gender (sex), returns the results (3 items) in 52 seconds:

2 893694
1 583999
0 198

The second does this for patient weight - rounded to the closest 10 lbs increment. It takes 580 seconds to place the data into 67 bins. Initially I tried running it on the rounded weights but aborted the run as it was a taking an inordinate amount of time (there are 217 distinct weights in the dataset).

Is there a way to improve the performance of this type of query?

Thanks,
Ron


(: 3 items - 52 sec :)
let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
for $value in distinct-values($safetyreport/patient/patientsex)
return concat($value, " ", count(index-of($safetyreport/patient/patientsex, $value)))

(: 67 items - 580 sec :)
let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
for $value in distinct-values($safetyreport/patient/patientweight ! (. div 10.0) ! round(.) ! (. * 10))
return concat($value, " ", count(index-of($safetyreport/patient/patientweight ! (. div 10.0) ! round(.) ! (. * 10), $value)))




The information in this email and any attachments are intended solely for the recipient(s) to whom it is addressed, and may be confidential and/or privileged. Any unauthorized distribution or copying of this transmittal or its attachments is prohibited. If you are not a named recipient or have received this email in error: (i) you should not read, disclose, or copy it, (ii) please notify the sender of your receipt by reply email and delete this email and all attachments.