Using restxq. I was hoping to speed things up with parallel processing :-).

We are using some new indices to speed things up and more can be done. The issue main issue with that we process a lot of files and there are multiple levels of processing:

1- Apply 1st level

2- Save to db

3- Apply 2nd level

4- Save to db

5- Apply 3rd level

Why we work by level is to be able to search content after it's been processed in a level. So we need indices to be refreshed. For each level I apply everything I can before I need to re-indexing.

The levels look something like that (with some variations):

1- Add ids to all elements (content coming from authors through webdav doesn't always have all the required ids)

2- Aggregate content for a publication... That means resolving references recursively until all the pieces that create a larger publication are aggregated

3- Filter out content that doesn't apply to the current configuration (done after aggregation because we may use the same aggregate for multiple filter combination - for example we may have a publication for 2 similar products where the same content is used but a few lines here and there are different... Getting the same publication out for 2 different OS version would be a good example. Same content, tiny differences here and there.)

4- Apply transformation to filtered aggregate (to one or more formats: HTML, PDF, csv, rss all or whatever is needed)

If I am outputting the publication in HTML and PDF for 26 of the 52 languages, I was hoping to be able to apply filter and aggregates on the 26 dbs pairs (base + staging) at once. Maybe I need 26 instances of BaseX where each instance has a lang... Then my js could call each instance individually. That's a lot of ports... and also again... not easy for clients to just add a language. If it means parallel processing, it may be worth it.

Then I'd need to figure out handling processes that use more than one instance of BaseX... like the translation processes. A lot of files would need to go through outside of baseX thought the .js. I might need a node.js layer. I can't imagine the .js client doing all the work... So far the client was pretty light, so the controlling was split between .js and .xqm. I though moving the lang loop outside of the .xqm would mean parallel processing just because each call to the .xqm function would be separate, each with their own $lang. As you know, that didn't do it. Oupsy.

Optimizing performance is key for us at this point... so any clue is welcomed.

The 2 most time intensive processes: creating the aggregates and transforming files to XLIFF for translation. what these process have in common... If I can stop holding the dbs when these run, I'm good.

I'm even considering processing all the small outputs to the file system and then import the result back once the process is over. Most operations would become read-only as far as BaseX is concerned... not my favorite approach, but it might do the trick...

On Wed, Feb 6, 2019 at 9:24 AM Christian Grün <christian.gruen@gmail.com> wrote:

Hi France,

I agree that duplicating the same code more than once is not a good
idea. I surely know too little about your use case, as I guessed you
were sending custom query strings to BaseX via one of our APIs. Are
you using REST oder RESTXQ?

It seems that your current update operation is pretty costly. Do you
think there are chances to speed it up?

Best,
Christian

On Wed, Feb 6, 2019 at 9:12 AM France Baril
<france.baril@architextus.com> wrote:
>
> Irsh, we have 52 languages and all our system is based on being able to work with any language and let clients add/remove languages without having to call developers. I can't imagine the domino effect of having to build a shell function per language per process that access the DB.
>
> Plus as we are running batch processes, I think we'll just run out of memory.
>
> I'm thinking one function like this per language is what you propose :
>
> rest-path /base/filter-es-us()
> function filter-es-us {
> let $src-db = db:open(es-us)
> let $results := apply-non-updating-processes($src-db)... where result is a map of (filename, xml)
> return
> for $result in $results
> return db:replace('staging-es-us', $results)
> };
>
> apply-non-updating-processes($src-db){
> map:merge(
> for $file in $src-db/*
> res= do x
> return map:entry ($file/base-uri, res)
> };
>
>
> Since we run batch processes I'm also thinking we'll run out of memory with processes like that... or maybe we need to split also small functions so each tiny update is in its own function... then maintaiing functions for 52 languages becomes even harder... or I add an extra layer of abstraction and build the .xqm functions dynamically based on a central code base and the dynamic language names... hmmm....
>
> I'm thinking out loud here trying to find my way outside of dynamic names... but static naming of databases doesn't sound like a good idea in our case. Dynamic naming is at the core of our approach... or maybe I'm so laced in it that I can't see the easy way in?
>
>
>
>
>
> On Mon, Feb 4, 2019 at 11:46 AM Christian Grün <christian.gruen@gmail.com> wrote:
>>
>> Hi France,
>>
>> > I noticed that the latest version of BaseX lost this feature and nothing seems to replace it. I'm trying to improve performance of batch processes and I was counting on that feature a lot. Any change it will come back or that something equivalent will come?
>>
>> With BaseX 9, we removed the classical GLOBALLOCK option (i.e.,
>> GLOBALLOCK = false is standard now).
>>
>> > get db:open($lang)/*
>> > process
>> > save to db:open('staging-' || $lang)
>>
>> The name of your database may be specified as static string in your
>> query (no matter if you use BaseX 8 or 9):
>>
>> get db:open('de')/*
>> process
>> save to db:open('staging-de')
>>
>> Did you try this already?
>> Christian
>
>
>
> --
> France Baril
> Architecte documentaire / Documentation architect
> france.baril@architextus.com