Re: [basex-talk] diacritics sensitive not working

3 Sep 2018


      Let's suppose you've got a map like:  (and that by just typing this into
the email I haven't left in any really horrible typos!)
let $drugInfo as map(xs:string,element()) := map:merge(
for $element in collection('newDrugInfo')/descendant::infoElement
let $name as xs:string := (: whatever you do to extract the official drug
name from the update data :)
return map:entry($name,$element))
then in the other docbase you've got:
let $updatePlaces as map(xs:string,element()+) := map:merge(
for $place in collection('updating-this-one')/descendant::couldBeInteresting
let $drugName as xs:string := (:whatever you're doing now to match the drug
name; there's an assumption you expect to find only one :)
where exists($drugName) (: because you might not have one! :)
group by $drugName (:baseX will magically make $place be a sequence of all
the $place values with this drug name, effectively a sequence of pointers
to those element nodes)
return map:entry($drugName,$place)
)
So now you can:
for $drug in map:keys($drugInfo) (: we're iterating through the official
list :)
let $needsUpdate as element()+ := $updatePlaces($drug)
for $place in $needsUpdate (: iterate through our sequence of pointers :)
(: do whatever you're doing to insert the information in $drugInfo($drug) :)
It looks like the same old n-squared inner-loop/outer-loop update process,
but I have found that it doesn't perform like that.  I am almost never
updating the docbase so whatever magic is involved may go away when you do
that, but I've found this "map both sides" pattern to be very useful when
merging data.
-- Graydon
On Sun, Sep 2, 2018 at 9:25 PM Ron Katriel rkatriel@mdsol.com wrote:
...
Hi Graydon,
Thanks for the suggestion. Could you provide sample code to help with
this? If needed I can share the relevant BaseX snippet.
Best,
Ron
On Sep 2, 2018, at 9:16 PM, Graydon Saunders graydonish@gmail.com wrote:
Maps that reference nodes are pointers, rather than copies.  It sounds
like you could map every drug name to every "interesting" XML node that
contains it using grouping during map creation and then just iterate on the
keys to process the nodes.
On Sun, Sep 2, 2018 at 4:52 PM Ron Katriel rkatriel@mdsol.com wrote:
...
Hi Christian,
As promised here is a summary of my experimentation. I replaced the
expensive join with a map lookup and the program finished in 4 minutes vs.
1 hour using a naive loop over the two databases (the original 6 hours
reported were due to overly aggressive virus scanning software, which I
turned off for this benchmarking).
The downside of not using “contains text” inside the double loop (due to
its slowness) is that I had to tokenize the CT.gov
https://urldefense.proofpoint.com/v2/url?u=http-3A__CT.gov&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=BBeNdMyieZ0Pe47vEYrOWalNS4uDt0n0tRpqmay-0Ug&s=ThHquw8-y4wRz3ejr0mTEm-ImDJHZ_DKVjr8_laQAps&e=
interventions and remove stopwords prior to looking them up in the DrugBank
map. This is a subpar solution as some drugs are missed (looking up all the
possible word combinations would be expensive).
It would be nice if there was a way to combine the matching flexibility
of the “contains text” construct (with its myriad of options) and the
efficiency of a map lookup but that may require a finite-state automaton
such as the Aho–Corasick algorithm. If you are aware of any existing
solutions I would appreciate your sharing them.
Thanks,
Ron
On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatriel@mdsol.com) wrote:
Hi Christian,
Thanks for the advise. The BaseX engine is phenomenal so I realized
quickly that the problem was performing a naive cross product.
Since this query is run only once a month (to serialize XML to CSV) and
applied to new data (DB) each time, a BaseX map will likely be the most
straightforward solution (I used the same idea for another project with
good results).
I will not be able to implement and test this for another couple of weeks
but will summarize my findings to the group as soon as possible.
Best,
Ron
...
On Aug 4, 2018, at 6:00 AM, Christian Grün christian.gruen@gmail.com
wrote:
...
Hi Ron,
...
I believe the slow execution may be due to a combinatorial issue: the
cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not
counting synonyms).
...
Yes, this sounds like a pretty expensive operation. Having maps
(XQuery, Java) will be much faster indeed.
As Gerrit suggested, and if you will run your query more than once, it
would definitely be another interesting option to build an auxiliary,
custom "index database" that allows you to do exact searches (this
database may still have references to your original data sets). Since
version 9 of BaseX, volatile hash maps will be created for looped
string comparisons. See the following example:
let $values1 := (1 to 500000) ! string()
let $values2 := (500001 to 1000000) ! string()
return $values1[. = $values2]
Algorithmically, 500'000 * 500'000 string comparisons will need to be
performed, resulting in a total of 250 billion operations (and no
results). The runtime is much faster as you might expect (and, as far
as I can judge, much faster than in any other XQuery processor).
Best,
Christian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] diacritics sensitive not working