Hello,

 

I have a database with several thousand XML documents, although I have tens of thousands I’d like to add. Each XML document contains a book — both the bibliographic metadata such as title, author, etc. (each in its own element) and the complete OCR text of all pages of the book. Each page of text from each book is in a <page> element with a single text node containing all words from that page in the book, resulting in large blocks of text.

 

I’ve added a full-text index and optimized it. I am finding that full-text searching is very slow. The query shown below consistently takes about 20 seconds to run, even though there are only about 7400 documents. Obviously that’s far too slow to use the query in a web application, where the user expects a quick response.

 

My first thought is whether the query is actually using the full-text index. Is there a way for me to determine that?

 

I’m also wondering if my query is crude or is missing something. I don’t need the text nodes containing the search words; I only need to know which documents contain the words.

 

let $keywords := "apple"

for $doc in collection("theocom")

let score $score := $doc contains text {$keywords}

order by $score descending

where $score > 0

return concat($score, " ", base-uri($doc))

 

As you can see, I’m searching all text in the entirety of each book. Is there a way to rewrite such a query for faster performance?

 

Also, I’m wondering if the structure of the XML documents is such that the documents themselves need to have smaller blocks of text. For example, if the OCR text were contained in <line> elements, each containing only a single line of text, as printed in the original physical book, would full-text searching be noticeably faster, since each text node is much smaller?

 

Thanks,

Greg