On March 10, 2021 at 3:44:28 PM, Martin Honnen (martin.honnen@gmx.de) wrote:
On 10.03.2021 21:37, Ron Katriel wrote:
> Hi,
>
> I would appreciate your advice on optimizing a query against a large
> BaseX (9.2.4) database. It is loaded with data from the FDA’s Adverse
> Event Reporting System (FAERS). Currently this is just the 2020 dataset
> which comprises 12 documents stored as 308,870,597 nodes (6,565 MB).
>
> The queries below effectively - though not necessarily efficiently -
> implement a histogram. The first, which is applied to patient gender
> (sex), returns the results (3 items) in 52 seconds:
>
> 2 893694
> 1 583999
> 0 198
>
> The second does this for patient weight - rounded to the closest 10 lbs
> increment. It takes 580 seconds to place the data into 67 bins.
> Initially I tried running it on the rounded weights but aborted the run
> as it was a taking an inordinate amount of time (there are 217 distinct
> weights in the dataset).
>
> Is there a way to improve the performance of this type of query?
>
> Thanks,
> Ron
>
>
> (: 3 items - 52 sec :)
> let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
> for $value in distinct-values($safetyreport/patient/patientsex)
> return concat($value, " ",
> count(index-of($safetyreport/patient/patientsex, $value)))
You could always try whether grouping performs better e.g.
for $ps in db:open('FAERS')/ichicsr/safetyreport/patientsex
group by $s := $ps
return $s || " " || count($ps)
>
> (: 67 items - 580 sec :)
> let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
> for $value in distinct-values($safetyreport/patient/patientweight ! (.
> div 10.0) ! round(.) ! (. * 10))
> return concat($value, " ",
> count(index-of($safetyreport/patient/patientweight ! (. div 10.0) !
> round(.) ! (. * 10), $value)))
for $pw in db:open('FAERS')/ichicsr/safetyreport/patientweight
group by $w := $pw ! (. div 10.0) ! round(.) ! (. * 10))
return $w || " " || count($pw)