I’m measuring the specific db:token() lookup in order isolate effects of other processing.

These are page view records per document covering several different published versions of each document, so for a given path you would expect at most three or four results, as opposed to 1000s of results.

 

My implementation is quite naïve in that I’m just chunking the raw CSV data into a database and then hoping the token index will provide good look up results, which has been my experience with other queries (look up times of 0.02 seconds or better), which makes the 0.3 second time a bit anomalous and makes me suspect an error on my end.

 

This is in the context of a generic “enable processing of any CSV data” feature, rather than a dedicated “report on page views data” feature, where I would construct a more efficient index (i.e., node IDs to page view data or something).

 

Here are the settings for the analytics database, which holds the CSV XML data:


NAME

_analytics

SIZE

257 MB

NODES

9793157

DOCUMENTS

11

BINARIES

0

VALUES

0

TIMESTAMP

2024-07-14T20:49:34.624Z

UPTODATE

RESOURCEPROPERTIES

INPUTPATH

INPUTSIZE

0 b

INPUTDATE

2024-04-17T21:37:04.516Z

INDEXES

TEXTINDEX

ATTRINDEX

TOKENINDEX

FTINDEX

TEXTINCLUDE

ATTRINCLUDE

TOKENINCLUDE

FTINCLUDE

LANGUAGE

English

STEMMING

CASESENS

DIACRITICS

STOPWORDS

UPDINDEX

AUTOOPTIMIZE

MAXCATS

100

MAXLEN

255

SPLITSIZE

0

 

 

Thanks,

 

Eliot

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

Digital Content & Design

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

 

From: Christian Grün <christian.gruen@gmail.com>
Date: Tuesday, July 16, 2024 at 9:32
AM
To: Eliot Kimber <eliot.kimber@servicenow.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Query optimization: What can I check or measure?

[External Email]

 


Hi Eliot,

 

It’s difficult to give a general response on that without having a complete look at the architecture, but I’ll try:

 

I’m measuring a consistent 0.3 seconds for this query:

 

How much time is spent if you omit the parent step?

 

  db:token($analyticsmgmt:analyticsDatabase, $docPath, 'topicpath')

 

Next, how much results do you get for a single request? Is it always a single result, or can it be a vast number? How are the values distributued (index:tokens may help to assess this)?

 

You can attach "=> prof:time()" to an expression to do some isolated performance measurements.

 

In principle, it makes no difference if the data is stored in one huge document or in millions of documents.

 

Best,

Christian