I’m measuring the specific db:token() lookup in order isolate effects of other processing.

These are page view records per document covering several different published versions of each document, so for a given path you would expect at most three or four results, as opposed to 1000s of results.

My implementation is quite naïve in that I’m just chunking the raw CSV data into a database and then hoping the token index will provide good look up results, which has been my experience with other queries (look up times of 0.02 seconds or better), which makes the 0.3 second time a bit anomalous and makes me suspect an error on my end.

This is in the context of a generic “enable processing of any CSV data” feature, rather than a dedicated “report on page views data” feature, where I would construct a more efficient index (i.e., node IDs to page view data or something).

Here are the settings for the analytics database, which holds the CSV XML data:

NAME	_analytics
SIZE	257 MB
NODES	9793157
DOCUMENTS	11
BINARIES	0
VALUES	0
TIMESTAMP	2024-07-14T20:49:34.624Z
UPTODATE	✓
RESOURCEPROPERTIES
INPUTPATH
INPUTSIZE	0 b
INPUTDATE	2024-04-17T21:37:04.516Z
INDEXES
TEXTINDEX	✓
ATTRINDEX	✓
TOKENINDEX	✓
FTINDEX	–
TEXTINCLUDE
ATTRINCLUDE
TOKENINCLUDE
FTINCLUDE
LANGUAGE	English
STEMMING	–
CASESENS	–
DIACRITICS	–
STOPWORDS
UPDINDEX	✓
AUTOOPTIMIZE	–
MAXCATS	100
MAXLEN	255
SPLITSIZE	0

Thanks,

Eliot

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

Digital Content & Design

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

From: Christian Grün <christian.gruen@gmail.com>
Date: Tuesday, July 16, 2024 at 9:32 AM
To: Eliot Kimber <eliot.kimber@servicenow.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Query optimization: What can I check or measure?

[External Email]

Hi Eliot,

It’s difficult to give a general response on that without having a complete look at the architecture, but I’ll try:

I’m measuring a consistent 0.3 seconds for this query:

How much time is spent if you omit the parent step?

db:token($analyticsmgmt:analyticsDatabase, $docPath, 'topicpath')

Next, how much results do you get for a single request? Is it always a single result, or can it be a vast number? How are the values distributued (index:tokens may help to assess this)?

You can attach "=> prof:time()" to an expression to do some isolated performance measurements.

In principle, it makes no difference if the data is stored in one huge document or in millions of documents.

Best,

Christian

RESOURCEPROPERTIES

INDEXES

[External Email]