Tamara,

Thanks for posting these results—that’s really useful information. Hardware optimization is usually the easiest solution if you can afford it…

Cheers,

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

From: BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.de> on behalf of Tamara Marnell <tmarnell@orbiscascade.org>
Date: Friday, February 25, 2022 at 12:20 PM
To: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: [basex-talk] Results of some experiments for improving full-text search speeds

[External Email]

Good morning--I hope everyone in Europe is safe and well.

This week I've been working on improving the speed of full-text searches in our website, and I wanted to share what I've found to be helpful and not so helpful.

Our project: We run a website of finding aids from 47 archival repositories in the western US. Archivists submit their finding aids as EADs through a CMS that indexes each document for full-text searching in BaseX, with other custom indexes for facets like subject, material type, etc. https://archiveswest.orbiscascade.org/

Our issue: We need to perform full-text searches that match documents with all of the terms in the query. The BaseX full-text index and ft:search() function work at the text node level. So we need to use the mode "any word," then group by a unique identifier in the root and run an ft:contains() with the mode "all words," like this:

for $db_id in tokenize($d, '\|')

for $result in ft:search($db_id, $terms, map{'mode':'any word','fuzzy':$f})
let $ead := $result/ancestor::ead
let $ark := normalize-space($ead/eadheader/eadid/@identifier)

(: Other node values fetched here :)

group by $ark

where ft:contains(string-join($result, ' '), $terms, map{'mode':'all words','fuzzy':$f})

(: Custom ranking calculations based on other node values performed here :)

return <ark>{$ark}</ark>

Most users of the site are interested in their repositories only, and they use only a few words to find known collections, so most of the time results are delivered in less than 5 seconds. If users want to search for multiple terms across all 47 repositories, though, the search was taking 30-40 seconds.

What I DIDN'T find helpful: Modifying the XQuery

1. Performing a separate search for each term and mapping to fetch with db:open-id() later, instead of searching for all terms at once and grouping. This was slightly slower:

for $db_id in tokenize($d, '\|')
let $results := fold-left(

(: Get the ft:search() results as a map of node IDs and results :)

)

for $node_id in map:keys($results)
let $ead := db:open-id($db_id, $node_id)

(: Ranking etc. :)

2. For multi-term queries, creating a temporary table to search instead of the complete full-text indexes of all databases.

a. Create a database like temp123456 with a full-text index.

b. For the first term, loop through all databases add the documents to temp123456 with db:add($temp_id, db:path($ead)) and optimize.

c. For subsequent terms, map the results of a full-text search temp123456 instead of looping through all 47 databases, and then loop through all EADs in the temp database and remove the ones that don't match with...

if (not(exists($temp_results(db:node-id($ead))))) then db:delete($temp_id, db:path($ead))

d. For the final term, return the usual ranked <ark /> values, then drop the database.

This didn't work because the optimization of the temporary databases took a long time--25-30 seconds for a term with 1K document results, according to the BaseX logs. It would work in theory only if the existing full-text index could be parceled out for subsequent terms, instead of being freshly rebuilt.

What I DID find helpful: Changing the server

We use Amazon Web Services, so I experimented with putting the project on different instance types: General, Memory-Optimized, and Compute-Optimized.

The contenders:

General t2.medium: 2 vCPU, 4 GB memory
General t3.large: 2 vCPU, 8 GB memory
Compute-optimized c6g.xlarge: 4 vCPU, 8 GB memory
Memory-optimized r6g.large: 2 vCPU, 16 GB memory

At the time of testing we were on t2.medium, which is the most affordable and the smallest I thought would be sufficient.

For one-term queries, the difference in speed was not that much across the instance types. For longer queries, the number of virtual CPUs and memory made a big difference.

Query: native american tribes in oregon
t2.medium: 55 seconds
t3.large: 30 seconds
r6g.large: 22 seconds
c6g.xlarge: 21 seconds

Memory:

- Increasing from 4 GB on the t2.medium to 8 GB on the t3.large and c6g.xlarge nearly doubled the speed of my queries, even for queries of 2 words like "women's march" (20 seconds on t2.medium, 8-10 seconds on the others).

- Increasing from 8 GB to 16 GB on a memory-optimized instance didn't affect the speed. The java process used only 10% of the available memory, versus 50-60% on other instance types, but the speed wasn't any faster, even increasing the maximum heap size in the basexhttp startup script. This could be because heap size isn't the right property to adjust.

CPU:

- Increasing from 2 vCPU to 4 vCPU on a compute-optimized instance improved speeds only for extreme queries of 4 words or more.

papers
t3.large: 6 seconds
c6g.xlarge: 7 seconds

family papers
t3.large: 18 seconds
c6g.xlarge: 18 seconds

adams family papers
t3.large: 15 seconds
c6g.xlarge: 14 seconds

adams family papers utah
t3.large: 25 seconds
c6g.xlarge: 21 seconds

For most queries our users are performing, which are short and restricted to one repository, it's a negligible difference that didn't justify the increase in cost for a compute-optimized instance ($40 per month vs. $60 per month).

-Tamara

Tamara Marnell

Program Manager, Systems

Orbis Cascade Alliance (orbiscascade.org)

Pronouns: she/her/hers