Christian,
Thanks for sharing that. I assumed all along that this happens automatically. Anyway, I ran my query (for one drug, to save time) and see the following in the Info view
- apply text index for "Lenalidomide"
I believe the slow execution may be due to a combinatorial issue: the cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not counting synonyms).
I am considering an algorithmic solution that involves storing the DrugBank information in a hash table (map) and looking it up while iterating through the CT.gov http://clinicaltrials.gov trials.
Best, Ron
On August 3, 2018 at 5:49:30 PM, Christian Grün (christian.gruen@gmail.com) wrote:
Our documentation should help you here: http://docs.basex.org/wiki/Indexes https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Indexes&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=mk1COTV1sAZu82fBqU9P70ZPQXi-d6NrV1-5QYTPHOo&s=Esza6Q3FyaDERIFJTWBAjifLIDVFW3bWKMLS4hbqv_A&e=
Ron Katriel rkatriel@mdsol.com schrieb am Fr., 3. Aug. 2018, 23:20:
Hi Christian,
Yes, I created a full-text index when the databases where loaded (see the commands below). I also verified that FTINDEX is true for both databases (in the GUI under Database > Open & Manage).
How do I ensure that my query is rewritten for index access?
Thanks, Ron
SET FTINDEX true; SET TOKENINDEX true; CREATE DB CTGov "/Data Sets/ ct.gov/xml https://urldefense.proofpoint.com/v2/url?u=http-3A__ct.gov_xml&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=mk1COTV1sAZu82fBqU9P70ZPQXi-d6NrV1-5QYTPHOo&s=nDUqSutsQr7QyD8E6-XysRp1qudWO6I05tJaWjkCUI4&e= " SET FTINDEX true; SET TOKENINDEX true; SET STRIPNS true; CREATE DB DrugBank “/Data Sets/DrugBank/drugbank.xml"
On August 3, 2018 at 4:12:43 PM, Christian Grün (christian.gruen@gmail.com) wrote:
Hi Ron,
Did you a) create a full-text index for your data and b) ensure that your query is rewritten for index access?
Best, Christian
On Fri, Aug 3, 2018 at 2:39 PM Ron Katriel rkatriel@mdsol.com wrote:
Christian,
Adding diacritics sensitive slows execution by a factor of 3. My script
(fragment below), which joins two large databases, namely CT.gov and DrugBank, takes 2 hours without the diacritics sensitive constraint but 6 hours with it. Given the combinatorics involved, I am wondering if there is a better way to do this in BaseX.
Thanks, Ron
for $drug in db:open('DrugBank')/drugbank/drug let $drug_name := $drug/name/text() let $drug_synonyms :=
functx:value-union(normalize-space(lower-case($drug/name)), local:drug-synonyms($drug_name))
for $synonym_name in $drug_synonyms ... for $study in
db:open('CTGov')/clinical_study[intervention/intervention_name contains text { $synonym_name } using case insensitive using diacritics sensitive]
...
Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions 350 Hudson Street, 7th Floor, New York, NY 10014 rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598
| main: +1 212 918 1800
On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatriel@mdsol.com)
wrote:
Thanks, Christian. Strange, prior to contacting you and on a hunch, I
tried adding the missing “using” keyword but still got the syntax error. Anyway, everything is good now!
Best, Ron
On August 1, 2018 at 3:57:51 AM, Christian Grün (
christian.gruen@gmail.com) wrote:
I have fixed the example in the doc. Best, Christian
On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel rkatriel@mdsol.com wrote:
Hi,
The following from your website (docs.basex.org/wiki/Full-Text
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full-2DText&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=mk1COTV1sAZu82fBqU9P70ZPQXi-d6NrV1-5QYTPHOo&s=fzrCGjX9wfPKGZuwd7u4KJ4_AyzK0ZQtU9_PRyCam3U&e=) appears to be syntactically incorrect
"'Äpfel' will not be found..." contains text "Apfel" diacritics
sensitive
In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported
Unexpected end of query: 'diacritic sens...'.
This happens in version 8.6.4 and also the latest (9.0.2).
Thanks, Ron
Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
350 Hudson Street, 7th Floor, New York, NY 10014
rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675
5598 | main: +1 212 918 1800