Hi,
I am trying to perform a join operation between two large XML files (~490 MB and ~40 MB), which are the result of the automatic conversion of old sql dumps into XML files. I created two databases for the files. The query I wrote to join them is correct because it works when I limit the join to just a few items, but it never ends if I apply it to all items:
here is the xquery: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/join_files.xq here is the first file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_parses.xml here is the second file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_lemmas.xml
I have also tried to use the database module functions, but without success. Am I missing anything here? Thanks.
Ciao, Giuseppe
Am 11.07.2020 um 14:41 schrieb Giuseppe G. A. Celano:
I am trying to perform a join operation between two large XML files (~490 MB and ~40 MB), which are the result of the automatic conversion of old sql dumps into XML files. I created two databases for the files. The query I wrote to join them is correct because it works when I limit the join to just a few items, but it never ends if I apply it to all items:
here is the xquery: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/...
Isn't the where $nn kind of meaningless? I don't think you can have an empty sequence $nn, as you don't use `allowing empty` when you bind that variable in the nested `for`.
No idea of course whether that changes the problem you encounter.
It is the remnant of a previous version of the script, but it does not affect the query, as far as I have seen. It is deleted now.
On Jul 11, 2020, at 3:05 PM, Martin Honnen martin.honnen@gmx.de wrote:
Am 11.07.2020 um 14:41 schrieb Giuseppe G. A. Celano:
I am trying to perform a join operation between two large XML files (~490 MB and ~40 MB), which are the result of the automatic conversion of old sql dumps into XML files. I created two databases for the files. The query I wrote to join them is correct because it works when I limit the join to just a few items, but it never ends if I apply it to all items:
here is the xquery: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/...
Isn't the where $nn kind of meaningless? I don't think you can have an empty sequence $nn, as you don't use `allowing empty` when you bind that variable in the nested `for`.
No idea of course whether that changes the problem you encounter.
On 11.07.2020 14:41, Giuseppe G. A. Celano wrote:
I am trying to perform a join operation between two large XML files (~490 MB and ~40 MB), which are the result of the automatic conversion of old sql dumps into XML files. I created two databases for the files. The query I wrote to join them is correct because it works when I limit the join to just a few items, but it never ends if I apply it to all items:
here is the xquery: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... here is the first file:
Saxon EE seems to be capable of handling it (loading the files with the doc function of course instead of from a db), although needing more than 2GB of memory.
Not sure where BaseX struggles, I am sure someone of the BaseX team can tell you soon.
Saxon HE also struggles so it must be some of the advanced join optimizations in EE that allow it to run that query in a reasonable time.
One more solution that should be evaluated faster (the data to be looked up is directly stored in a map):
declare variable $hib_parses:= db:open('hib_parses'); declare variable $hib_lemmas := db:open('hib_lemmas');
let $lemmas := map:merge( for $row in $hib_lemmas//row where $row/field[@name = 'lemma_lang_id'] = '3' return map:entry($row/field[@name = 'lemma_id'], $row) , map { 'duplicates': 'combine'})
for $parse in $hib_parses//row for $lemma in $lemmas($parse/field[@name = 'lemma_id']) return (# db:copynode false #) { element wf { <f>{ $parse/* }</f>, <l>{ $lemma/* }</l> } }
On 7/11/20, Giuseppe G. A. Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I am trying to perform a join operation between two large XML files (~490 MB and ~40 MB), which are the result of the automatic conversion of old sql dumps into XML files. I created two databases for the files. The query I wrote to join them is correct because it works when I limit the join to just a few items, but it never ends if I apply it to all items:
here is the xquery: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/join_files.xq here is the first file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_parses.xml here is the second file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_lemmas.xml
I have also tried to use the database module functions, but without success. Am I missing anything here? Thanks.
Ciao, Giuseppe
Hi Christian,
Thank you so much for your quick answer! The scripts you give both work efficiently on my end! I actually forgot about the use of pragmas, but I tried to force the use of indexes by specifying data()/text nodes, but they did not work. On the contrary, I remembered that maps can “do the trick", so I first converted XML into JSON, and then tried to merge the files, but it did not work either. If it is of interest to you, I uploaded the files and query here:
script: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/join_json_files.xq 1st file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_parses.json 2nd file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_lemmas.json
Thanks again for your help!
Ciao, Giuseppe
On 12. Jul 2020, at 15:46, Christian Grün christian.gruen@gmail.com wrote:
One more solution that should be evaluated faster (the data to be looked up is directly stored in a map):
declare variable $hib_parses:= db:open('hib_parses'); declare variable $hib_lemmas := db:open('hib_lemmas');
let $lemmas := map:merge( for $row in $hib_lemmas//row where $row/field[@name = 'lemma_lang_id'] = '3' return map:entry($row/field[@name = 'lemma_id'], $row) , map { 'duplicates': 'combine'})
for $parse in $hib_parses//row for $lemma in $lemmas($parse/field[@name = 'lemma_id']) return (# db:copynode false #) { element wf { <f>{ $parse/* }</f>, <l>{ $lemma/* }</l> } }
On 7/11/20, Giuseppe G. A. Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I am trying to perform a join operation between two large XML files (~490 MB and ~40 MB), which are the result of the automatic conversion of old sql dumps into XML files. I created two databases for the files. The query I wrote to join them is correct because it works when I limit the join to just a few items, but it never ends if I apply it to all items:
here is the xquery: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/join_files.xq here is the first file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_parses.xml here is the second file: https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/... https://git.informatik.uni-leipzig.de/celano/perseus_morpheus/-/blob/master/hib_lemmas.xml
I have also tried to use the database module functions, but without success. Am I missing anything here? Thanks.
Ciao, Giuseppe
basex-talk@mailman.uni-konstanz.de