Hey Tim -

On Fri, Oct 4, 2024 at 5:53 PM Thompson, Timothy <timothy.thompson@yale.edu> wrote:

Thanks, Bridger! `file:write-text-lines` seems to be the issue. For example, this query doesn’t run in parallel.

 

You're right - apologies for missing this key point in your initial email. 

Is this expected behavior?

declare variable $PATH := "";

 

xquery:fork-join(

  for $_ in (1 to 8)

  return fn() {   

    file:write-text-lines(     

      $PATH||$_||".json",

      for $i in (1 to 1000000)

      return

        serialize(

        <fn:map>

          <fn:string key="n">{$i}</fn:string>

        </fn:map>, {"method": "json", "escape-solidus": "no", "json": {

          "format": "basic", "indent": "no"

        }}

      )

    )

  },

{ "parallel": "8"}

)

 

It does seem to be the case that the writes in `file:write-text-lines` are *not* parallel vs a sequential use of the same:
I did the following comparison:

using your example,
ls -l --time-style=full-iso /tmp/fork-test
total 130860
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:39:57.926518544 +0000 1.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:02.849576119 +0000 2.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:07.799634010 +0000 3.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:28.652877890 +0000 4.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:12.892693574 +0000 5.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:18.140754950 +0000 6.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:23.569818443 +0000 7.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:39.098000046 +0000 8.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:33.779937851 +0000 9.json

vs

using a sequential write:
declare variable $PATH := "/tmp/fork-test/sequential/";

for $i in (1 to 9)
return
  file:write-text-lines(
    $PATH || $i || ".json",
    for $n in (1 to 1000000)
    return
      serialize(
        <fn:map>
          <fn:string key="n">{$n}</fn:string>
        </fn:map>,
        { "method": "json", "escape-solidus": "no",
          "json": { "format": "basic", "indent": "no" }
        }
      )
  )

ls -l --time-style=full-iso /tmp/fork-test/sequential
total 130860
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:19.841259435 +0000 1.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:24.820319704 +0000 2.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:29.838380446 +0000 3.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:35.041443427 +0000 4.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:40.182505657 +0000 5.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:45.305567669 +0000 6.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:50.535630977 +0000 7.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:55.703693534 +0000 8.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:50:00.948757024 +0000 9.json

each file in both attempts takes about 5ms to write, with the exception that the writes are non-sequential in the fork-join example. I wonder if it's due to the appending in `file:write-text-lines`?
Maybe Christian can chime in and let us know :)

Have a nice weekend!
Best,
Bridger

 

-- 
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research

Interim Manager, Metadata Services Unit

www.linkedin.com/in/timathompson

 

 

From: Bridger Dyson-Smith <bdysonsmith@gmail.com>
Date: Wednesday, October 2, 2024 at 1:05
PM
To: Thompson, Timothy <timothy.thompson@yale.edu>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Write files in parallel?

hi Tim - hope you are well.

In the past (i.e. I don't remember exactly if this was perfectly parallel, it was just "parallel enough"), I have used something like the following for web requests:

xquery:fork-join(
  for $xml in ('calq.xqm','factbook.xml','filesystem.xml','locations.xml','wiki1.zip', 'wiki2.zip','xmark.xml')
  let $url := 'https://files.basex.org/xml/'
  return fn() {
    file:write(
      '/tmp/fork-test/' || $xml,
      http:send-request(
        <http:request method='get'/>,
        $url || $xml
      )
    )
  },
  map { 'parallel': '3'}
)

Hopefully that's helpful (and apologies to the BaseX team's file server)!

Best,

Bridger

 

) ls -l --time-style=full-iso
total 11640
-rw-r--r-- 1 bridger bridger    1593 2024-10-02 17:02:51.321251082 +0000 calq.xqm
-rw-r--r-- 1 bridger bridger 1763070 2024-10-02 17:02:52.301261520 +0000 factbook.xml
-rw-r--r-- 1 bridger bridger 2770290 2024-10-02 17:02:53.331272491 +0000 filesystem.xml
-rw-r--r-- 1 bridger bridger 1566322 2024-10-02 17:02:52.497263608 +0000 locations.xml
-rw-r--r-- 1 bridger bridger  512686 2024-10-02 17:02:52.670265451 +0000 wiki1.zip
-rw-r--r-- 1 bridger bridger 5133340 2024-10-02 17:02:54.046280106 +0000 wiki2.zip
-rw-r--r-- 1 bridger bridger  155448 2024-10-02 17:02:52.859267464 +0000 xmark.xml

 

 

On Tue, Oct 1, 2024 at 5:32PM Thompson, Timothy <timothy.thompson@yale.edu> wrote:

Hello,

 

Is it possible to call file:write-text-lines in parallel inside a fork-join operation? I have multiple databases that I would like to run a query over, in parallel, and write the results as JSON Lines to a file per database. When I try this, it doesn’t seem to parallelize.

 

Thanks in advance,

Tim

 

 

-- 
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research

Interim Manager, Metadata Services Unit

Yale University Library

www.linkedin.com/in/timathompson