Lack of capability to deal appropriately with whitespaces (and punctuation) results in false positives in our StratML-enabled query service at https://search.aboutthem.info/

Will look forward to learning if anything can be done about it.



On Wednesday, February 14, 2024 at 05:38:41 AM EST, Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de> wrote:


Whitespace is probably only a minor factor here. It can’t explain the loading times that grow non-linearly with document count.

Dietmar, have you looked at the memory consumption? My experience is that if memory gets scarce, garbage collection will kick in frequently, slowing down the import process. Increasing -Xmx in the startup script might improve the import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for example, and see whether there is an improvement. You can see the memory consumption in the GUI, so try to create the DB from the GUI.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:
> Thanks for the addition, Liam; I should have mentioned that.
>
> If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…
>
> <p xml:space='preserve'>The <em>very</em> <id>tc34q</id>.</p>
>
> …whitespace stripping will be safe.
>
> Similarly, it may be helpful to know that the whitspace gets lost if XML strings…
>
> <p>The <em>very</em> <id>tc34q</id>.</p>
>
> …are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:
>
> declare boundary-space preserve;
> <p>The <em>very</em> <id>tc34q</id>.</p>
>
> Whitespace handling is generally a tricky issue in XML.
>
> Best,
> Christian
>
>
> On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin <liam@fromoldbooks.org <mailto:liam@fromoldbooks.org>> wrote:

>
>    On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
>>
>>    If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:
>>
>>    SET STRIPWS ON; CREATE DB ...
>>    db:create('db', '/path/to/documents', (), map { 'stripws': true() })
>
>    Beware that this is not schema-based, and can remove whitespace nodes in mixed content -
>    <p>The <em>very</em> <id>tc34q</id>.</p>
>    may become (as i understand it)
>          <p>The <em>very</em><id>tc34q</id>.</p>
>    (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)
>
>    liam
>