BaseX-Talk April 2015

basex-talk@mailman.uni-konstanz.de

32 participants
60 discussions

multi-language full-text indexing
by Goetz Heller 23 Apr '15

23 Apr '15

Here's another addendum: Even if multi-language full-text indexing is not going tob e implemented in the near future, it still would be a useful feature to be able to restrict full-text indexing to parts of a document, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/PART_A, (path_b)/ PART_B,… ) Kind regards, Goetz -----Ursprüngliche Nachricht----- Von: Christian Grün [mailto:christian.gruen@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing > It is desirable to have > documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.ht… [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.ht… > > > > CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( > > (path_a)/LOCALIZED_PART_A[@LANG=$lang], > > (path_b)/LOCALIZED_PART_B[@LG=$lang],… > > ) FOR LANGUAGE $lang IN ( > > BG, > > DN, > > DE WITH STOPWORDS filepath_de WITH STEM = YES, > > EN WITH STOPWORDS filepath_en, > > FR, … > > ) [USING language_code_map] > > and then to write full-text retrieval queries with a clause such as > ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller > and full-text retrieval therefore much faster. The language codes > would be mapped somehow to standard values recognized by BaseX in the > language_code_map file. > > Are there any efforts towards such a feature?

4 4

Creation of Full-Text-Index failed
by Goetz Heller 22 Apr '15

22 Apr '15

For the task at hand I need to create a database on a daily base from file packages I received. The language taken here is German, however the files contain lots of international characters as well. Usually this does not harm, and I don't know if this is the real cause of failure in this case. Actually, the database was created, but an error message occurred which was not very specific: "file xxx could not be parsed". File "xxx" was the last file of the package, and it was accessible for xQuery search. However, no full-text index was created as with the other packages. Trying to create the index directly resulted in a different message: "Improper use? . Stack Trace: java.lang.ArrayIndexOutOfBoundsException". The package can be downloaded from http://www.hellerim.de/downloads/BaseX/20150203_023.7z. This does not look like a problem with the data but rather like a bug in BaseX. If I'm wrong, however, I would prefer to get a message which points me to the problem so I can try to solve it. Kind regards, Goetz

2 2

Distributing queries to several on several processors
by Goetz Heller 22 Apr '15

22 Apr '15

So far I did not find any information on how BaseX can be advised how to use computing resources. The use case here is as follows: I get several megabytes of xml files each day, usually between 50 and 100 MB. These are organized in one database per day. Since most queries run on a daily base this works perfectly fine. However, there are situations when I need to run a query over a larger time span, say three or six months. (Note that I'm speaking of read-only queries here, not of transactions.) Of course I can do this in a loop (for $db in $db-list) but since the data in each database is completely independent from that in the other databases it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery? I'm aware of the possibility to split the sequence into several ones and run them in different threads on different connections using Java, for instance. But even then I still don't know what the server does (my queries run in a client-server configuration): will it occupy just one processor, or will it distribute the workload?

4 4

multi-language full-text indexing
by Goetz Heller 22 Apr '15

22 Apr '15

I'm working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g. CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( (path_a)/LOCALIZED_PART_A[@LANG=$lang], (path_b)/LOCALIZED_PART_B[@LG=$lang],. ) FOR LANGUAGE $lang IN ( BG, DN, DE WITH STOPWORDS filepath_de WITH STEM = YES, EN WITH STOPWORDS filepath_en, FR, . ) [USING language_code_map] and then to write full-text retrieval queries with a clause such as 'FOR LANGUAGE BG', for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file. Are there any efforts towards such a feature?

3 5

Re: [basex-talk] Distributing queries to several on several processors
by Christian Grün 22 Apr '15

22 Apr '15

Hi Götz (cc @ basex-talk), > OK, I think I understand. However, I think there should be some possibilities to allow the user to give hints. In my opinion, FOR-loops would be first-class candidates to use parallel streams, in particular in the use case I described in my previous posting: > > FOR $var IN (collection) > PARALLEL RETURN (expression-list) Makes sense, in general.. XQuery pragmas could be solution: (# basex: parallel #) { ... } Higher-order functions provide functions like hof:parallel-map(...). However, it has many effects on the architecture of BaseX in terms of performance, because we'd need to create new contexts for each parallelized query, which takes additional time. See the following query as example: $x[. = "123"] The dot applies to the "current context item". If we parallelize a query, we'd have multiple current context items. The same multiplication would apply to the stack frame and other runtime variables, and the time lost for duplicating these instances is in most cases more expensive than doing stuff in a single thread. At least that's our experience so far. Once again, we are happy to see people jump into our code and show us that it can be done better.. Christian

2 1

multi-language full-text indexing
by Goetz Heller 22 Apr '15

22 Apr '15

The case you described should be made a non-issue: If a multi-language full-text index was created then it was surely intended to execute searches within the confines of a specific language. Hence, if none was specified in the query, a runtime error should be thrown in such cases. Kind regards, Goetz -----Ursprüngliche Nachricht----- Von: Christian Grün [mailto:christian.gruen@gmail.com] Gesendet: Mittwoch, 22. April 2015 11:03 An: Goetz Heller Cc: BaseX Betreff: Re: [basex-talk] multi-language full-text indexing > It is desirable to have > documents indexed by locale-specific parts, e.g. I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted. This would pretty much blow up the existing architecture. As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]). Hope this helps, Christian [1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.ht… [2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.ht… > > > > CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH ( > > (path_a)/LOCALIZED_PART_A[@LANG=$lang], > > (path_b)/LOCALIZED_PART_B[@LG=$lang],… > > ) FOR LANGUAGE $lang IN ( > > BG, > > DN, > > DE WITH STOPWORDS filepath_de WITH STEM = YES, > > EN WITH STOPWORDS filepath_en, > > FR, … > > ) [USING language_code_map] > > and then to write full-text retrieval queries with a clause such as > ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller > and full-text retrieval therefore much faster. The language codes > would be mapped somehow to standard values recognized by BaseX in the > language_code_map file. > > Are there any efforts towards such a feature?

2 1

RESTXQ accept/produces issue
by Marc van Grootel 22 Apr '15

22 Apr '15

Hi, I spend a couple of hours pulling my "hair" before I realized what was going on here. Question: what happens when I call a RESTXQ function which has a rest:produces('application/xml') annotation but the request does not have a Accept header? This is what HTTP 1.1 spec[1] says about that: "If no Accept header field is present, then it is assumed that the client accepts all media types. If an Accept header field is present, and if the server cannot send a response which is acceptable according to the combined Accept field value, then the server SHOULD send a 406 (not acceptable) response." In fact, what does happen is that you get a 404, and this is caused by the rest:produces annotation. In a REST call you do not always set or have the option to set an appropriate accept header (e.g. HTTP client libraries or when doing doc('http://.....') call from XSLT). I believe that when no Accept header is present the response should assume that any mediatype is ok. Additionally it would be nice for REST clients if in case the path matches but the content-negotiation fails that a 406 would be returned instead of a 404. The latter is saying the resource does not exist, whereas 406 expresses that the issue is with the media-type but the resource may exist. Quite possibly the text in the RESTXQ spec has to be modified as well in that case because it currentlly reads (consistent with current behaviour): "If the %rest:produces annotation is specified, a function will only be invoked if the HTTP Accept header of the request matches one of the given types." Would it be possible to get this changed? Or is it maybe better to take this up in another forum? [1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html [2] http://exquery.github.io/exquery/exquery-restxq-specification/restxq-1.0-sp… --Marc

2 3

Creation of Full-Text-Index failed
by Goetz Heller 21 Apr '15

21 Apr '15

Addendum: This is an issue with German language only. Indexing with all other languages work as expected given there is enough RAM available to the JVM (sometimes the error messages issued are somewhat strange when they report negative array indexes in case of insufficient memory). Meanwhile I found two other packages exhibiting the same problem. The Java Batch runs on a Windows 8 Box with Base 8.1.1 and Java 1.8.0_40_b25 installed in client-server mode.

1 0

Optimizing Element Access By Attribute Value Matching
by Eliot Kimber 21 Apr '15

21 Apr '15

DITA defines the notion of layered hierarchy of element types, where every DITA-defined element is either a base type or a "specialized" type derived from some base type. The type hierarchy of each element is specified by a @class attribute that lists the ancestry and leaf type of the element. For example, the element type "concept" is a specialization of the base type "topic" and so has a @class value of "- topic/topic concept/concept ". Each blank-delimited term is a module name/element name pair. Processing in DITA is "specialization aware" if selection of elements is in terms of a @class token rather than concrete element type. For example, you might apply processing to topics of any type by matching on "*[contains(@class, ' topic/topic ')]", which will match all DITA topics, regardless of their specialized type. The challenge this presents in a database context is optimizing finding of things based on these @class values. For large repositories an XQuery like "//*[contains(@class, ' topic/topic ')]" is going to be quite slow as it requires a string comparison of every @class value. Even if there is an attribute value index it will still be slow. The obvious solution would be to index by @class token, e.g., an index where keys are "topic/topic", "topic/p", etc. Is there a way to construct such an index in BaseX? Is there a better to address type of string-match-based lookup? Thanks, Eliot ————— Eliot Kimber, Owner Contrext, LLC http://contrext.com

4 5

Simple xQuery functions do not work as expected
by Goetz Heller 21 Apr '15

21 Apr '15

Hi Christian, nice tip - I did not know this. With BaseX I'm really a newbie, and I just try to get my work done. As to now, it's a very nice experience, and BaseX supports many of the requirements I have. I'll give it a try and let you know! Kind regards

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

BaseX-Talk April 2015