Hi Daniel,
Thanks for your mail. Just a short while ago, we had thoughts on how to extend indexing and query rewriting without completely rehauling our optimization engine, so it might be worth sharing this idea.
At the moment, as you may know, only text nodes and attribute values end up in the BaseX indexes. This allows us to rewrite as many paths as possible for index access. Whenever a path expression points to a text node (or an element that only has text nodes as children), we know that such a path can be rewritten for index access, no matter how the exact paths look like that point to this text node. This design decision turned out to be very powerful for exact searches, and for full-text queries on arbitrary text nodes, but it is too unflexible for mixed-content data indeed.
Over the time, we needed to learn that full flexibility can be helpful, but is not necessarily required in many TEI use cases: Many users and developers have a rather small and fixed set of XML elements that is relevant for full-text processing.
A few years ago, we added features to restrict indexing to the text nodes of specific element names. We could enhance this approach for full-texts:
1. Index the string value of specific elements, which will be specified by the user, and 2. Rewrite only paths for index access that do not address descendants of the indexed element.
As an example, a user might want to query the "head" and "p" elements of a TEI documents, and there will be no need to write queries for descendants of these elements.
<div> <head>No. 2, September 2006</head> <p>It was clearly popular, for it appears in Peter Stent’s advertisements of 1654 and 1662, and is still listed in his successor John Overton’s catalogue of 1673,<note>Alexander Globe, <title level="m">Peter Stent, London Printseller, c.</title> 1642-65 (Vancouver, 1985), p. 123 (no.*448).</note> yet only the unique impression in the British Museum's Department of Prints & Drawings survives - testimony to the great rarity of such popular material.</p> </div>
The following queries could then be answered via the index:
/div[head contains text '2006'] //p[. contains text 'popular']
Queries such as the following ones would not be rewritten for index access anymore:
//p[text() contains text 'popular']
It might additionally be desirable to exclude specific elements from indexing. In the given example, users might want to exclude notes ("note" elements) from being included in the indexed string value.
There are numerous other features that could be included. The major challenge will be to define a simple core functionality that is flexible enough to be enhanced in future.
Daniel, what’s your opinion on this, and your first thoughts on what might be missing?
Thanks in advance, Christian
On Thu, Sep 19, 2019 at 12:43 AM Schopper, Daniel Daniel.Schopper@oeaw.ac.at wrote:
Dear all, chatting after a session of the ongoing TEI conference ( https://graz-2019.tei-c.org) I was asked about plans to support fulltext indexes on mixed content nodes in BaseX – I did not know of any, so I wanted to pass the question on to this list: Is there a plan to implement this feature in the near (or not-so-near) future? If not, did somebody of the core devs estimate the effort to get this done? (needless to say that it would an awsome feature to have in BaseX ;-) Thanks in advance & best Daniel (just being curious)