Hi again,

I had a quick look into the monitoring code, and I noticed two things:

1. It looks to me (correct me if I’m wrong) as if the code of the project was initially written for Saxon and then ported to BaseX. If you are interested in using BaseX, you could focus on the slow functions, try alternative writings and (if you want to run the code with both processors in the future) ensure that Saxon still gives delivers good performance.

2. Some functions can be noticeably sped up (for both BaseX and Saxon) if you use XQuery 3.1 features such as maps or group by. For example, the runtime of #131014 could possibly be reduced with something similar to…

  for $ms in $Monitoring/*:MonitoringSite
  let $emsc := $ms/*:euMonitoringSiteCode
  for $ceqm in $ms/*:ChemicalEcologicalQuantitativeMonitoring
  let $V_rech := $ceqm/*:parameterCode || '/' || $ceqm/*:parameterOther || '/' || $ceqm/*:chemicalMatrix
  group by $group := $emsc || ': ' || $V_rech
  where count($ceqm) > 1
  return $V_rech

If BaseX turns out to be the way to go, it’s definitely worth taking advantage of the database aspect. In BaseX, databases are fairly light-weight, which means you can simply create them before running the queries (e.g., with a single 'CREATE DB poc /path/to/poc_rapportage_controle-main/xml' command) and use db:get('poc', 'your-doc.xml') in the queries to access a document (or even stick with doc('your-doc.xml') if you enable DEFAULTDB [1]).

Hope this helps,
Christian



On Mon, Apr 22, 2024 at 9:32 AM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Antonio,

As Liam indicated, you may get better performance when adding your documents to a database.

In general, though, the runtimes of BaseX and Saxon have aligned pretty much over the years, and I assume there’ll be a trivial reason behind the drastic difference in the runtime.

Your test setup is probably too complex for us readers to spend more time with it. Could you possibly share a more basic example with us, ideally with a single document and query file?

Thanks in advance,
Christian



On Mon, Apr 22, 2024 at 8:54 AM ANDRADE Antonio <antonio.andrade@ofb.gouv.fr> wrote:

@Liam R. E. Quin : Thanks for your feedback. The processing time is between 2 minutes and more than 11 hours (see table below). Thus, the loading time of the Java virtual machine has little impact. The main XQuery script loads the XML document once at the start of processing. It is then requested several times as part of more or less complex quality controls. At this moment, the XML document is not intended to be stored. This is why it is not loaded into a database before processing.

 

 

Saxon

BaseX

 

Start

Stop

Elapse time

Start

Stop

Elapse time

Check Monitoring 2022 FRH

06:16:54

06:19:30

00:02:36

06:44:06

10:05:21

03:21:15

Check Multi schéma 2022 FRH

06:25:46

06:31:47

00:06:01

10:05:55

11:39:07

01:33:12

 

 

De : Liam R. E. Quin <liam@fromoldbooks.org>
Envoyé : samedi 20 avril 2024 05:00
À : ANDRADE Antonio <antonio.andrade@ofb.gouv.fr>; basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] Performance issue with BaseX CLI

 

On Fri, 2024-04-19 at 10:45 +0200, ANDRADE Antonio wrote:

Hie,

 

For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines.

 

First, you should consider (as i think Martin said) the Java runtime startup time, typically a second or so.

 

Second, BaseX is a database. If you will process the same document many times, first load it into a database and then use the Python BaseX client. This will avoid startup time, and, more importantly, will allow BaseX to make use of database indexes.

 

If you will only process any given document once, then Saxon may well be the appropriate tool.

 

liam

 

 

-- 

Available for XML/Document/Information Architecture/XSLT/

XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.

Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org