I am loading page view data exported from Adobe as CSV as XML with a structure like:
<csv timestamp="2024-01-24"> <record language="en-US" family="vancouver" bundlename="bundle-platcap-platform-capabilities" topicpath="administer/general/concept/capabilities-bundle-landingpage.dita" notviewed="viewed" viewcount="120587"> <entry name="URL">docs.servicenow.com/bundle/vancouver-servicenow-platform/page/administer/general/concept/capabilities-bundle-landingpage.html</entry> <entry name="Page Views">120587</entry> <entry name="Unique Visitors">50863</entry> <entry name="Bounce Rate">0.255523003</entry> <entry name="Content Velocity">3.289644475</entry> <entry name="Docs Engagement Rate">0.289549233</entry> </record> … 50K more records … </csv>
For this database I have a token index and the MAXLEN value is set to 255 to ensure that the @topicpath values will be tokenized (none should be longer than 255).
I’m measuring a consistent 0.3 seconds for this query:
let $docRecords as element()* := db:token($analyticsmgmt:analyticsDatabase, $docPath, 'topicpath')/..
Where $docPath is a value that will match tokens in the @topicpath attribute.
Based on other token-based optimizations I’ve done, I would expect faster results, at least 10-times faster.
My questions:
1. What can I look for that might be making this lookup slower than expected or, conversely, how can I prove that this is the fastest the query will return? 2. Is there a better way to manage this kind of tabular data with many 1000s of records within BaseX? For example, does it matter if the records are part of a single document or would it be better to have each record be a separate document?
Thanks,
Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer Digital Content & Design O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
Hi Eliot,
It’s difficult to give a general response on that without having a complete look at the architecture, but I’ll try:
I’m measuring a consistent 0.3 seconds for this query:
How much time is spent if you omit the parent step?
db:token($analyticsmgmt:analyticsDatabase, $docPath, 'topicpath')
Next, how much results do you get for a single request? Is it always a single result, or can it be a vast number? How are the values distributued (index:tokens may help to assess this)?
You can attach "=> prof:time()" to an expression to do some isolated performance measurements.
In principle, it makes no difference if the data is stored in one huge document or in millions of documents.
Best, Christian
I’m measuring the specific db:token() lookup in order isolate effects of other processing.
These are page view records per document covering several different published versions of each document, so for a given path you would expect at most three or four results, as opposed to 1000s of results.
My implementation is quite naïve in that I’m just chunking the raw CSV data into a database and then hoping the token index will provide good look up results, which has been my experience with other queries (look up times of 0.02 seconds or better), which makes the 0.3 second time a bit anomalous and makes me suspect an error on my end.
This is in the context of a generic “enable processing of any CSV data” feature, rather than a dedicated “report on page views data” feature, where I would construct a more efficient index (i.e., node IDs to page view data or something).
Here are the settings for the analytics database, which holds the CSV XML data:
NAME _analytics SIZE 257 MB NODES 9793157 DOCUMENTS 11 BINARIES 0 VALUES 0 TIMESTAMP 2024-07-14T20:49:34.624Z UPTODATE ✓ RESOURCEPROPERTIES INPUTPATH INPUTSIZE 0 b INPUTDATE 2024-04-17T21:37:04.516Z INDEXES TEXTINDEX ✓ ATTRINDEX ✓ TOKENINDEX ✓ FTINDEX – TEXTINCLUDE ATTRINCLUDE TOKENINCLUDE FTINCLUDE LANGUAGE English STEMMING – CASESENS – DIACRITICS – STOPWORDS UPDINDEX ✓ AUTOOPTIMIZE – MAXCATS 100 MAXLEN 255 SPLITSIZE 0
Thanks,
Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer Digital Content & Design O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Christian Grün christian.gruen@gmail.com Date: Tuesday, July 16, 2024 at 9:32 AM To: Eliot Kimber eliot.kimber@servicenow.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Query optimization: What can I check or measure? [External Email]
________________________________ Hi Eliot,
It’s difficult to give a general response on that without having a complete look at the architecture, but I’ll try:
I’m measuring a consistent 0.3 seconds for this query:
How much time is spent if you omit the parent step?
db:token($analyticsmgmt:analyticsDatabase, $docPath, 'topicpath')
Next, how much results do you get for a single request? Is it always a single result, or can it be a vast number? How are the values distributued (index:tokens may help to assess this)?
You can attach "=> prof:time()" to an expression to do some isolated performance measurements.
In principle, it makes no difference if the data is stored in one huge document or in millions of documents.
Best, Christian
I tried to create some code that we can use for joint testing. With the following snippet, you can create a database (sized around 400 MB) with 1 million value attributes and around 300.0000 distinct values:
db:create( 'test', <xml>{ for $i in 1 to 1000000 let $value := codepoints-to-string( random:seeded-integer($i mod 300000, 256, 26) ! (. + 97) ) return <item value='{ $value }'/> }</xml>, 'test.xml', map { 'maxlen': 256, 'tokenindex': true() } )
The following query chooses a random entry and returns all elements that contain the attribute with this value:
let $count := count(index:tokens('test')) let $pos := (abs(random:integer()) mod $count) + 1 return db:token('test', index:tokens('test')[$pos])
The second query takes around 3 ms in my tests. Do you get similar performance?
On my mac running 10.7, I get about 1.5ms per query. On my production sever, I get values like 0.08ms per query.
So there must be something going on with my CSV content or how I’m querying that is leading to the much slower lookup times…
Cheers,
E
_____________________________________________ Eliot Kimber Sr Staff Content Engineer Digital Content & Design O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Christian Grün christian.gruen@gmail.com Date: Tuesday, July 16, 2024 at 11:16 AM To: Eliot Kimber eliot.kimber@servicenow.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Query optimization: What can I check or measure? [External Email]
________________________________ I tried to create some code that we can use for joint testing. With the following snippet, you can create a database (sized around 400 MB) with 1 million value attributes and around 300.0000 distinct values:
db:create( 'test', <xml>{ for $i in 1 to 1000000 let $value := codepoints-to-string( random:seeded-integer($i mod 300000, 256, 26) ! (. + 97) ) return <item value='{ $value }'/> }</xml>, 'test.xml', map { 'maxlen': 256, 'tokenindex': true() } )
The following query chooses a random entry and returns all elements that contain the attribute with this value:
let $count := count(index:tokens('test')) let $pos := (abs(random:integer()) mod $count) + 1 return db:token('test', index:tokens('test')[$pos])
The second query takes around 3 ms in my tests. Do you get similar performance?
basex-talk@mailman.uni-konstanz.de