Dear BaseX Team,
We are looking at BaseX to serve as the fulltext query engine for a customer's content store.
The customer is a publishing company. They have organized their content in a directory structure like
/works/ work1/ src/ chap1.xml chap2.xml ... /work2/ src/ chap1.xml chap2.xml ... ...
If I import all content into a database (which seems to map to, or even is synonymous to, a collection in BaseX lingo), there's no way to restrict queries to just one specific work. I thought that looking for something below //root()[matches(document-uri(.), '/work2')] might help, only to discover that a) all documents in this collection are flat, i.e., 'work1' or 'work2' isn't part of any root node's document-uri() and b) this kind of query seems to be very inefficient.
In qizx, I create a what they call library group, import all documents below a certain local directory into that library group and then query it by, e.g., for $d in collection('/works/work2')//*[. ftcontains 'foo' ] return <result base-uri="{document-uri(root($d))}">{$d}</result>
Not only that I'm not able to restrict a query to a certain subdirectory, I cannot create links to search results in BaseX. The publisher's content resides in an svn repository. Update of the XML search engine's data pool is being triggered by an svn commit hook. The user should get a URL like http://svn.acme.com/viewsvn/?do=view&project=works&path=/work2/src/c... as a query result, in order to be able to browse the content. (Until versioning becomes part of a client/server, XQFT-enabled XML database that can also store arbitrary non-XML files, they will probably continue to use svn as their content repository.) In BaseX, we cannot generate links to work2/src/chap2.xml since chap2.xml's document uri is lacking the path parts.
I was thinking about putting each work in a collection of its own. But then I don't see how I could possibly perform a full text search of the whole content (all collections). And if each project contains a substructure work2/ trunk/ src/ chap1.xml chap2.xml ... tags/ ed04/ src/ chap1.xml chap2.xml then the number of collection to create or query still grows.
Can you recommend a BaseX way of dealing with these requirements? Maybe inserting path information as a processing instruction into each XML file (this could be performed by a commit hook, too)? Inserting the paths as content or as an attribute might not be possible because some of the schemas -- ranging from custom DTDs to public XHTML or DocBook schemas -- might not be altered, or rather: patching the files to accomodate for an XML database seems acceptable, while altering the schemas or creating custom derived schemas from public ones seems unacceptable. Searching PIs may be inefficient however. Any other ideas?
Best regards,
Gerrit
Gerrit,
thanks for your feedback. As you correctly observed, all collections in BaseX are flat, i.e., don't mirror the original document structure. The preservation of file paths is on our todo list, though; you'll get a notification as soon as this issue is resolved.
Christian
On Mon, Apr 19, 2010 at 12:26 AM, Imsieke, Gerrit, le-tex gerrit.imsieke@le-tex.de wrote:
Dear BaseX Team,
We are looking at BaseX to serve as the fulltext query engine for a customer's content store.
The customer is a publishing company. They have organized their content in a directory structure like
/works/ work1/ src/ chap1.xml chap2.xml ... /work2/ src/ chap1.xml chap2.xml ... ...
If I import all content into a database (which seems to map to, or even is synonymous to, a collection in BaseX lingo), there's no way to restrict queries to just one specific work. I thought that looking for something below //root()[matches(document-uri(.), '/work2')] might help, only to discover that a) all documents in this collection are flat, i.e., 'work1' or 'work2' isn't part of any root node's document-uri() and b) this kind of query seems to be very inefficient.
In qizx, I create a what they call library group, import all documents below a certain local directory into that library group and then query it by, e.g., for $d in collection('/works/work2')//*[. ftcontains 'foo' ] return <result base-uri="{document-uri(root($d))}">{$d}</result>
Not only that I'm not able to restrict a query to a certain subdirectory, I cannot create links to search results in BaseX. The publisher's content resides in an svn repository. Update of the XML search engine's data pool is being triggered by an svn commit hook. The user should get a URL like http://svn.acme.com/viewsvn/?do=view&project=works&path=/work2/src/c... as a query result, in order to be able to browse the content. (Until versioning becomes part of a client/server, XQFT-enabled XML database that can also store arbitrary non-XML files, they will probably continue to use svn as their content repository.) In BaseX, we cannot generate links to work2/src/chap2.xml since chap2.xml's document uri is lacking the path parts.
I was thinking about putting each work in a collection of its own. But then I don't see how I could possibly perform a full text search of the whole content (all collections). And if each project contains a substructure work2/ trunk/ src/ chap1.xml chap2.xml ... tags/ ed04/ src/ chap1.xml chap2.xml then the number of collection to create or query still grows.
Can you recommend a BaseX way of dealing with these requirements? Maybe inserting path information as a processing instruction into each XML file (this could be performed by a commit hook, too)? Inserting the paths as content or as an attribute might not be possible because some of the schemas -- ranging from custom DTDs to public XHTML or DocBook schemas -- might not be altered, or rather: patching the files to accomodate for an XML database seems acceptable, while altering the schemas or creating custom derived schemas from public ones seems unacceptable. Searching PIs may be inefficient however. Any other ideas?
Best regards,
Gerrit
___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
basex-talk@mailman.uni-konstanz.de