Full-text search and mixed content

List overview All Threads
Download

newer

older

REST service problems - 401-Header...

how to formulate an excluding...

Michael Piotrowski

8 May 2012 8 May '12

2:20 p.m.

Hi,

In "XQuery and XPath Full Text 1.0 Use Cases" [1] it says:

Querying across element boundaries is similar to an XQuery and XPath character string function converting the sub-tree under an element into a string by removing all markup.

However, I'm having trouble to get this to work in BaseX 7.2.1. For example, given this document:

<doc> Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet </doc>

How do I query, for example, for p elements that contain the string "sit amet"?

//p[. contains text 'sit amet']

returns only paragraphs 1 and 2. I've tried a number of variations, but I've been unable to come up with a query that returns all four p elements.

Are my queries incorrect or is this a bug in BaseX?

Thanks and best regards

Footnotes: [1] http://www.w3.org/TR/xpath-full-text-10-use-cases/

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044 * OUT NOW: Systems and Frameworks for Computational Morphology * http://www.springeronline.com/978-3-642-23137-7

Show replies by date

Christian Grün

8 May 8 May

2:37 p.m.

Dear Michael,

to get the requested result, you need to deactivate the chopping of whitespaces (via SET CHOP OFF, or Dialog → New… → Parsing → Chop Whitespaces).

Hope this helps, Christian

...

Footnotes: [1] http://www.w3.org/TR/xpath-full-text-10-use-cases/

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044

OUT NOW: Systems and Frameworks for Computational Morphology

http://www.springeronline.com/978-3-642-23137-7

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Michael Piotrowski

3:59 p.m.

Hi Christian,

On 2012-05-08, Christian Grün christian.gruen@gmail.com wrote:

...

to get the requested result, you need to deactivate the chopping of whitespaces (via SET CHOP OFF, or Dialog → New… → Parsing → Chop Whitespaces).

Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.

Another question: The examples in the documentation and the actual behavior of BaseX suggest that ft:extract and ft:mark only work for queries on a single text node, i.e., queries such as

//p[text() contains text 'sit amet']

but not for

//p[. contains text 'sit amet']

Is this correct?

Thanks and greetings

Christian Grün

4:05 p.m.

...

Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.

Thanks; your edits are welcome.

...

Another question: The examples in the documentation and the actual behavior of BaseX suggest that ft:extract and ft:mark only work for queries on a single text node, [...]

Yes, that's true. In terms of the syntax, a dot (.) may work as well, if the addressed elements have no more descendant elements, as the following example shows:

x.xml: <x>A</x>

query: ft:mark(doc('x.xml')//x[. contains text 'a'])

Hope this helps, Christian

Michael Piotrowski

6:16 p.m.

On 2012-05-08, Christian Grün christian.gruen@gmail.com wrote:

...

...
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.

Thanks; your edits are welcome.

Ok, I'll add this then.

...

...
Another question: The examples in the documentation and the actual behavior of BaseX suggest that ft:extract and ft:mark only work for queries on a single text node, [...]

Yes, that's true. In terms of the syntax, a dot (.) may work as well, if the addressed elements have no more descendant elements, as the following example shows:

x.xml: <x>A</x>

query: ft:mark(doc('x.xml')//x[. contains text 'a'])

Thanks for the information, that's good to know. I think I'll file an enhancement request then: For text-oriented applications (e.g., TEI documents), it would be extremely useful if ft:mark would work with descendant elements; typically you have lots of mixed content, with elements containing rendering information or annotations, such as <hi>, <orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These elements don't interrupt the logical text flow.

Best regards

Christian Grün

6:47 p.m.

...

Thanks for the information, that's good to know. I think I'll file an enhancement request then: For text-oriented applications (e.g., TEI documents), it would be extremely useful if ft:mark would work with descendant elements; typically you have lots of mixed content, with elements containing rendering information or annotations, such as <hi>, <orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These elements don't interrupt the logical text flow.

While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?

ft:mark(<a>X Y Z</a>[. contains text 'X Y'])

If you don't need the inner elements, you may as well remove them from your document before applying ft:mark(). This is demonstrated by the following example:

copy $x := document { This is a simple test. } modify for $t in $x//*[*][text()] return replace value of node $t with data($t) return ft:mark($x//*[text() contains text 'test'])

If speed is a top priority, you may as well create a new database from all updated source files and build a full text index for the new database...

for $i in collection('...') return copy $x := $i modify for $t in $x//*[*][text()] return replace value of node $t with data($t) return db:add(...)

Christian

Cerstin Mahlow

9 May 9 May

3:58 a.m.

Hi Christian,

Zitat von Christian Grün christian.gruen@gmail.com:

...

...
Thanks for the information, that's good to know. I think I'll file an enhancement request then: For text-oriented applications (e.g., TEI documents), it would be extremely useful if ft:mark would work with descendant elements; typically you have lots of mixed content, with elements containing rendering information or annotations, such as <hi>, <orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These elements don't interrupt the logical text flow.

While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?

ft:mark(<a>X Y Z</a>[. contains text 'X Y'])

I think it should be

Each token from the search string would be enclosed in a -element.

I once asked for continuous marking, i.e., only one opening and one closing mark-tag for a search string consisting of several words. This would be not suitable for elements where I assume the hit to cover inner elements because of overlapping mark-up. But this might be solved with an option for ft:mark, to apply it either continuously or for each token found.

...

If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().

This is a great idea if you would like to know whether the search elements are somewhere in your text.

However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.

And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.

But maybe for searching and displaying the results in the original document, one would have to develop a bigger application.

Best regards

Cerstin

-- Dr. phil. Cerstin Mahlow Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.

Michael Piotrowski

8:17 a.m.

On 2012-05-09, Cerstin Mahlow cerstin.mahlow@unibas.ch wrote:

...

...
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?

ft:mark(<a>X Y Z</a>[. contains text 'X Y'])

I think it should be

<a>X Y Z</a>

Each token from the search string would be enclosed in a -element.

Exactly. While this probably wouldn't cover *all* possible scenarios, it would still cover most of the useful ones. In fact, it would be similar to http://www.raymondhill.net/blog/?p=272. It would also be applicable when ignoring elements in a search.

For complex applications it may help to get the start and end character positions of the matches (essentially standoff markup), and the application could then do the highlighting itself on the basis of this information.

[...]

...

...
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().

This is a great idea if you would like to know whether the search elements are somewhere in your text.

However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.

And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.

I fully agree, this is exactly what I need in my application: I don't want to retrieve snippets from the document, but I always have to display the full document with the hits highlighted.

What I'm going to do now is probably highlight the full paragraph which contains the node retrieved by the search, i.e., get the node ID, walk up the tree until I encounter a and get its @xml:id, which I can then use in a CSS stylesheet. Or something like this. But this is clearly only an approximation.

Best regards

Christian Grün

8:42 a.m.

Thanks, Cerstin and Michael, for your suggestions.

Yes, all the use cases sound perfectly reasonable to me. When it comes to the implementation, I see quite a number of obstacles to make this happen. One of the reasons is that the full-text expression, as currently implemented, discards all elements before tokenizing the texts. This means that the following queries are basically the same:

<a>X Y Z</a>[. contains text 'X Y'] <a>X Y Z</a>[data() contains text 'X Y']

As Cerstin indicated, you'll probably have to parse all text nodes individually;

ft:mark(//*[text() contains text {'X', 'Y'}])

This simple approach, however, won't work out with phrases (multiple terms) that reach into descendant nodes.

Christian

...

...
...
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?

ft:mark(<a>X Y Z</a>[. contains text 'X Y'])

I think it should be

<a>X Y Z</a>

Each token from the search string would be enclosed in a -element.

Exactly. While this probably wouldn't cover *all* possible scenarios, it would still cover most of the useful ones. In fact, it would be similar to http://www.raymondhill.net/blog/?p=272. It would also be applicable when ignoring elements in a search.

For complex applications it may help to get the start and end character positions of the matches (essentially standoff markup), and the application could then do the highlighting itself on the basis of this information.

[...]

...
...
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().

This is a great idea if you would like to know whether the search elements are somewhere in your text.

However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.

And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.

I fully agree, this is exactly what I need in my application: I don't want to retrieve snippets from the document, but I always have to display the full document with the hits highlighted.

What I'm going to do now is probably highlight the full paragraph which contains the node retrieved by the search, i.e., get the node ID, walk up the tree until I encounter a and get its @xml:id, which I can then use in a CSS stylesheet. Or something like this. But this is clearly only an approximation.

Best regards

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044

OUT NOW: Systems and Frameworks for Computational Morphology

http://www.springeronline.com/978-3-642-23137-7

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

4:08 p.m.

...

Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.

..thanks again for editing our Wiki -- always welcome! I've slightly added your last paragraph to indicate that full-text tokenization always works on the string values of the elements:

http://docs.basex.org/wiki/Full-Text#Mixed_Content

Feel free to further polish it, Christian

Manfred Knobloch

10 May 10 May

3:14 a.m.

New subject: Full-text search and mixed content - yet another ft question

yet another fulltext question

I'm looking for a performant way to look up fulltext information in a collection of image descriptions.

The db / document contains less than 10.000 items. The structure is easy:

When trying to get the title ('titel') of all nodes that contain a search string 'mag' i get the following:

doc('bilder')/bilder/bild[text() contains text 'mag' ]/titel no result

doc('bilder')/bilder/bild[. contains text 'mag' ]/titel 18 Hits Query Time about -> 950 ms

the following query produces the same result but is much faster ft:search(doc('bilder'),'mag')/ancestor::*[local-name(.) = 'bild']/titel 18 Hits Query Time less than -> 3 ms

Can i say that one should prefer ft:search then climbing the ancestor axis over the 'contains text'. The documentations says the 'contains text' is processed using the fulltext index. If so, how can this performance diffrence be explained?

TIA

-- Manfred Knobloch - Medientechnik -

Leibniz-Institut für Wissensmedien IWM - Knowledge Media Research Center (KMRC) Schleichstraße 6 | 72076 Tübingen Fon: +49 7071 979-340 Internet: http://www.iwm-kmrc.de/m.knobloch

Michael Piotrowski

11 May 11 May

5:34 p.m.

On 2012-05-09, Christian Grün christian.gruen@gmail.com wrote:

...

...
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.

..thanks again for editing our Wiki -- always welcome! I've slightly added your last paragraph to indicate that full-text tokenization always works on the string values of the elements:

http://docs.basex.org/wiki/Full-Text#Mixed_Content

Thanks for the clarification. Thank you also for creating the ticket for ft:mark.

Frankly, I find it quite dangerous that CHOP is ON by default. Discarding whitespace in mixed content means losing information. I'd find it preferable if it were off by default; if you know your data and if you are aware of the effects of CHOP, *then* you could turn it on.

Best regards

Imsieke, Gerrit, le-tex

6:09 p.m.

On 2012-05-11 23:34, Michael Piotrowski wrote:

...

Frankly, I find it quite dangerous that CHOP is ON by default. Discarding whitespace in mixed content means losing information. I'd find it preferable if it were off by default; if you know your data and if you are aware of the effects of CHOP, *then* you could turn it on.

Whitespace should always be preserved, unless the parser strips it because it knows that it’s ignorable whitespace (because it has been made aware of a schema?).

Looking at Saxon’s strip option, http://www.saxonica.com/documentation/javadoc/net/sf/saxon/s9api/WhitespaceS...

Saxon’s -strip:ignorable has a certain appeal, but when you consider how “ignorable” is specified:

...

The value IGNORABLE indicates that whitespace text nodes in element-only content are discarded.

fringe cases instantly come to your mind where it will strip a whitespace too much: <hi rend="bold">Hello</hi> <hi rend="bolditalic">World</hi>

“element-only content:” I’m not sure whether Saxon decides that it is in element-only content based upon information from the parser (“no text node allowed here”) or whether it draws its own conclusions (“no text node present here”).

Hmm.

Imsieke, Gerrit, le-tex

6:12 p.m.

On 2012-05-12 00:09, Imsieke, Gerrit, le-tex wrote:

...

I’m not sure whether Saxon decides that it is in element-only content based upon information from the parser (“no text node allowed here”) or whether it draws its own conclusions (“no text node present here”).

s/no text node/no non-WS text node/g;

Cerstin Mahlow

11:56 a.m.

Hi Christian,

Zitat von Christian Grün christian.gruen@gmail.com:

...

to get the requested result, you need to deactivate the chopping of whitespaces (via SET CHOP OFF, or Dialog ? New? ? Parsing ? Chop Whitespaces).

I just ran into a related issue:

When creating my collection, I did not deactivate whitespace chopping. This results in:

Das ich aber Eur<hi rend="italic" xml:id="tg5.2.22.1">Excellenz,</hi>Hochwürden vnd Gnaden dises wintzige Werckel demütigst zuschreibe /hab ich ein sehr fügliche Ursach / weil ich nemlich dises kleine Tractätl habe zusammen getragen in der stattlichen Behausung Ihro Hochgräfflichen<hi rend="italic" xml:id="tg5.2.22.2">Excellenz</hi>Herrn Hanß Balthasar Graffen von<hi rend="italic" xml:id="tg5.2.22.3">Hojos</hi>der Zeit wertisten Landmarschall vnnd geheimen<hi rend="italic" xml:id="tg5.2.22.4">Deputir</hi>ten Rath ...

instead of:

Das ich aber Eur <hi rend="italic" xml:id="tg5.2.22.1">Excellenz,</hi> Hochwürden vnd Gnaden dises wintzige Werckel demütigst zuschreibe /hab ich ein sehr fügliche Ursach / weil ich nemlich dises kleine Tractätl habe zusammen getragen in der stattlichen Behausung Ihro Hochgräfflichen <hi rend="italic" xml:id="tg5.2.22.2">Excellenz</hi> Herrn Hanß Balthasar Graffen von <hi rend="italic" xml:id="tg5.2.22.3">Hojos</hi> der Zeit wertisten Landmarschall vnnd geheimen <hi rend="italic" xml:id="tg5.2.22.4">Deputir</hi>ten Rath ...

When the user want's to annotate something, he selects the respective text with the mouse, presses the "mark" button and saves the annotation. Later he might wan't to review it, or what ever. So I store the node-id and the marked part to later be able to use ft:mark to show the user what he selected.

On thing that is a bit ugly is that the person "Herrn Hanß Balthasar Graffen von Hojos" is displayed as "Herrn Hanß Balthasar Graffen vonHojos", because of the missing whitespace. However, I can choose to save either

"Herrn Hanß Balthasar Graffen vonHojos" (the data of the selected part) or

"Herrn Hanß Balthasar Graffen von<hi rend="italic" xml:id="tg5.2.22.3">Hojos" (the exact part of the node)

But I have no idea how to make this selection visible in a later step.

The first question:

If I want to get whitespaces back, do I have to re-create the collection? Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?

The second question:

How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think. I cannot store the whole node with the new markup, because in the end I will have different annotations for the same node (so I would store various versions of this node) and I don't see a clever way to collapse different versions of a node keeping all annotations to replace the original node.

Best regards

Cerstin

Christian Grün

13 May 13 May

3:30 p.m.

Hi Cerstin,

...

If I want to get whitespaces back, do I have to re-create the collection?

Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.

...

Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?

The node ids will change if the documents include pure whitespace texts. The following example represents such a document; it contains three text nodes ("X", and two text nodes with a single newline character):

...

How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think.

I'm not quite sure what you refer to here; could you attach a small example? Christian

PS@Michael and Gerrit: thanks for your opinion. One of the reasons for the chopping whitespaces by default is that whitespace texts in structured documents consume a lot of space in a database, although they will never need to be processed. However, I see that this solution may cause more confusion than be helpful, which is why we'll think about switching the default behavior.

Michael Piotrowski

14 May 14 May

6:30 a.m.

On 2012-05-13, Christian Grün christian.gruen@gmail.com wrote:

...

...
If I want to get whitespaces back, do I have to re-create the collection?

Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.

...
Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?

The node ids will change if the documents include pure whitespace texts. The following example represents such a document; it contains three text nodes ("X", and two text nodes with a single newline character):

<hello> <world>X</world> </hello>

I'll be working with Cerstin on this issue, so here's a brief comment. Thanks for the example, that's what I feared ... I think we're lucky that we're only dealing with node IDs of elements, so we can annotate the elements with ID attributes, associate the node IDs with the XML IDs, and then translate them again to the node IDs of the "unchopped" database. If we were dealing with node IDs of text nodes, we'd hosed ...

...

...
How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think.

I'm not quite sure what you refer to here; could you attach a small example? Christian

I *think* what she means is: Since

ft:mark(//p[. contains text 'real'])

will not highlight anything if . contains mixed content with multiple text nodes, what is the best approach to highlight the results of a search, given a query and a matching node?

...

PS@Michael and Gerrit: thanks for your opinion. One of the reasons for the chopping whitespaces by default is that whitespace texts in structured documents consume a lot of space in a database, although they will never need to be processed.

Yes, I figured that it was intended for data-oriented documents.

...

However, I see that this solution may cause more confusion than be helpful, which is why we'll think about switching the default behavior.

This would be very welcome! Your example above also nicely illustrates the problem. As the significance of whitespace in XML can only be determined when there's a schema, chopping whitespace by default means that, strictly speaking, documents are altered semantically on import unless you take special precautions--it should definitely be the other way round.

Best regards

Christian Grün

6:37 a.m.

Thanks Michael,

a short one..

...

ft:mark(//p[. contains text 'real'])

will not highlight anything if . contains mixed content with multiple text nodes, what is the best approach to highlight the results of a search, given a query and a matching node?

You may want to add the "any word" option to search all specified words as single terms; e.g.:

let $terms := ( 'real' ) return ft:mark(//p[.//text() contains text { $terms } any word])

This may yield unexpected results for phrases, though.

Hope this helps? Christian

Cerstin Mahlow

25 Jun 25 Jun

11:11 a.m.

New subject: whitespace

Hi,

I come back to this thread after some time:

Zitat von Christian Grün christian.gruen@gmail.com:

...

...
If I want to get whitespaces back, do I have to re-create the collection?

Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.

...
Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?

The node ids will change if the documents include pure whitespace texts.

I see.

Maybe someone can give me a hint on how to solve this problem:

I have a collection (Text-DB) created with whitespaces choped. Users already worked with this collection and so I have a relatively huge database (Collect-DB) consisting of 150 000 entries like this one:

<entry> <node>12345</node> <id>Ad0001</id> <query>contains abcd</query> </entry>

The "node" element contains the node-id from Text-DB where a certain xquery matched. The relevant nodes are paragraphs or lines from a TEI-document. I use the node-id and the query (as stored in the "query" element) in a later processing step to show the user the node with the relevant part by applying the original query to the original node using ft:mark.

When I re-create the collection with whitespace-chopping turned off, preserving the sequence of documents as in the whitespace-choped collection, the stored node-ids from Collect-DB would refer to completely different nodes. There is no way I could convince the users to do all the work again.

So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.

Could this be done using BaseX or should I rather do some Perl-scripting?

Best regards

Cerstin -- Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.

Christian Grün

3:48 p.m.

New subject: whitespace

Hi Cerstin,

...

[…] So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.

Could this be done using BaseX or should I rather do some Perl-scripting?

a straightforward solution could look as follows: _________________________

declare option output:separator '\n'; declare variable $texts1 := db:open('Text-DB')//text(); declare variable $texts2 := db:open('Text-DB-WS')//text();

for $text1 in $texts1 let $str1 := normalize-space($text1) let $id1 := db:node-id($text1) return $id1 || ': ' || string-join( for $text2 in $texts2 where $str1 = normalize-space($text2) return string(db:node-id($text2)) , ',') _________________________

The query retrieves all text nodes of the two databases. In a nested loop, all strings are compared against each other, and the resulting output will list the ids of the text nodes of the first document, followed by the ids of matchings texts of the second node:

3: 4,13 5: 7 7: 10 9: 4,13

If the database is too large, however, this approach may be too slow due to its O(n²) runtime. In that case, XQuery maps or the "group by" statement could probably be used to reduce the number of comparisons.

I hope this serves as a first inspiration, Christian

Michael Piotrowski

26 Jun 26 Jun

6:18 p.m.

New subject: whitespace

Hi,

On 2012-06-25, Cerstin Mahlow cerstin.mahlow@unibas.ch wrote:

...

So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.

I think this is doable. As you're only interested in *element* nodes ( and <l>), we can be certain that any node in Text-DB is also in Text-DB-WS, and that the path to a particular node in both databases is identical.

Here's my go at it. For simplicity, the variable $nodes contains the information that would actually come from Collect-DB.

--8<---------------cut here---------------start------------->8--- xquery version "3.0";

declare option output:separator '\n';

declare variable $bad := db:open('Text-DB'); declare variable $nodes := <nodes><id>499</id><id>713</id></nodes>;

for $id in $nodes//id let $path := replace(db:open-id($bad, $id)/path(), 'Q{.*?}', '*:') return $id || ' → ' || xquery:eval('db:node-id(db:open("Text-DB-WS")' || $path || ')') --8<---------------cut here---------------end--------------->8---

Apparently the return value from path() is not a valid XPath expression; as a workaround I simply replace the "Q{...}" namespace stuff with "*:". But I'm not an XQuery hacker, so there's probably a better way... In any case, the above code works on my test database.

HTH and greetings

Christian Grün

7:03 p.m.

New subject: whitespace

To complement this: while not completely made public yet (the next W3 working drafts are to be expected soon), the syntax returned by fn:path() is actually a valid XPath 3.0 expression; see [1] for more details.

Christian

[1] http://docs.basex.org/wiki/XQuery_3.0#Expanded_QNames

...

--8<---------------cut here---------------start------------->8--- xquery version "3.0";

declare option output:separator '\n';

declare variable $bad := db:open('Text-DB'); declare variable $nodes := <nodes><id>499</id><id>713</id></nodes>;

for $id in $nodes//id let $path := replace(db:open-id($bad, $id)/path(), 'Q{.*?}', '*:') return $id || ' → ' || xquery:eval('db:node-id(db:open("Text-DB-WS")' || $path || ')') --8<---------------cut here---------------end--------------->8---

Apparently the return value from path() is not a valid XPath expression; as a workaround I simply replace the "Q{...}" namespace stuff with "*:". But I'm not an XQuery hacker, so there's probably a better way... In any case, the above code works on my test database.

HTH and greetings

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044

OUT NOW: Systems and Frameworks for Computational Morphology

http://www.springeronline.com/978-3-642-23137-7

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Michael Piotrowski

27 Jun 27 Jun

2:40 a.m.

New subject: whitespace

Christian,

On 2012-06-27, Christian Grün christian.gruen@gmail.com wrote:

...

To complement this: while not completely made public yet (the next W3 working drafts are to be expected soon), the syntax returned by fn:path() is actually a valid XPath 3.0 expression; see [1] for more details.

Thanks for the clarification. The example given in the wiki

Q{http://www.w3.org/2005/xpath-functions/math%7Dpi()

works, but paths returned by path() don't work for me, e.g.,

/Q{http://www.tei-c.org/ns/1.0%7DTEI%5B1%5D/Q%7Bhttp://www.tei-c.org/ns/1.0%7Dt...]

or, for that matter,

/Q{http://www.tei-c.org/ns/1.0%7DTEI

neither raise an error nor do they match anything.

For the same database,

declare namespace tei = "http://www.tei-c.org/ns/1.0"; /tei:TEI

works as expected.

Bug?

Best regards

Cerstin Mahlow

6:04 a.m.

New subject: whitespace

Hi,

Zitat von Michael Piotrowski mxp@cl.uzh.ch:

...

As you're only interested in *element* nodes ( and <l>), we can be certain that any node in Text-DB is also in Text-DB-WS, and that the path to a particular node in both databases is identical.

Thanks for your code! As my collections consists of different documents, I had to include the document uri, otherwise the paths are ambiguous:

declare option output:separator '\n';

for $id in //entry/node/data() let $path := replace(db:open-id('Digibib-DTA-fuzzy', $id)/path(), 'Q{.*?}', '*:') let $base := replace(base-uri(db:open-id('Digibib-DTA-fuzzy', $id)), 'fuzzy', 'fuzzy-ws') return $id || ': ' || xquery:eval(concat('db:node-id(doc("', $base, '")', $path, ')'))

The replacement in $base changes the document-uri of the original collection to the new one. $id is extracted from collect-DB. Is there another way to get the complete path to the node without concatenating base-uri to path() avoiding the eval-construction?

It seems to work fine, I can create pairs of old and new node-ids for a test collect-DB with 15 entries.

For the entire collect-DB, when returning only $path or/and $base instead of executing the eval-construction, everything runs smoothly and takes around 74000 ms in the current 7.3.1 GUI for 108000 ids. Which is OK I think.

However, I get this in the console where I started basexgui:

2012-06-27 11:50:58.505 java[9887:1707] __CFServiceControllerBeginPBSLoadForLocalizations timed out while talking to pbs

What does this mean?

If I run the whole query (i.e., with the eval-construction), the GUI crashes and in the console this appears:

/opt/basex/bin/basexgui: line 32: 9907 Segmentation fault java -cp "$CP" $VM "${vm_args[@]}" org.basex.BaseXGUI "${general_args[@]}"

Any suggestions what this means and how to fix it?

Here some more info from the Apple "Fehlerbericht", I can send more, if needed.

Process: java [9516] Path: /usr/bin/java Identifier: com.apple.javajdk16.cmd Version: 1.0 (1.0) Code Type: X86-64 (Native) Parent Process: bash [9511]

Date/Time: 2012-06-26 23:45:42.008 +0200 OS Version: Mac OS X 10.6.8 (10K549) Report Version: 6

Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000000000b8 Crashed Thread: 6 Java: VM Thread

Best regards

Cerstin

Christian Grün

7:54 a.m.

New subject: whitespace

Hi Cerstin,

I'm sorry all bugs seem to be related to Java, particular the OSX versions of Java, and not BaseX itself, which is why we can't do here anything.

Christian ___________________________

...

However, I get this in the console where I started basexgui:

2012-06-27 11:50:58.505 java[9887:1707] __CFServiceControllerBeginPBSLoadForLocalizations timed out while talking to pbs

What does this mean?

If I run the whole query (i.e., with the eval-construction), the GUI crashes and in the console this appears:

/opt/basex/bin/basexgui: line 32: 9907 Segmentation fault java -cp "$CP" $VM "${vm_args[@]}" org.basex.BaseXGUI "${general_args[@]}"

Any suggestions what this means and how to fix it?

Here some more info from the Apple "Fehlerbericht", I can send more, if needed.

Process: java [9516] Path: /usr/bin/java Identifier: com.apple.javajdk16.cmd Version: 1.0 (1.0) Code Type: X86-64 (Native) Parent Process: bash [9511]

Date/Time: 2012-06-26 23:45:42.008 +0200 OS Version: Mac OS X 10.6.8 (10K549) Report Version: 6

Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000000000b8 Crashed Thread: 6 Java: VM Thread

Best regards

Cerstin

-- Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

This message was sent using IMP, the Internet Messaging Program.

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Cerstin Mahlow

11:59 a.m.

New subject: whitespace

Hi Christian,

Zitat von Christian Grün christian.gruen@gmail.com:

...

I'm sorry all bugs seem to be related to Java, particular the OSX versions of Java, and not BaseX itself, which is why we can't do here anything.

Strange. I get the first message "timed out while talking to pbs" for almost every interaction in the GUI. This is new, I didn't get this with former versions. Something must have changed.

However, I could ran the query in the GUI on the linux-server. Creating node-id pairs for 108359 ids took 9 047 211 ms, that's two and a half hours. So probably actually replacing node-ids in my collect-DB will take even longer ...

Best regards

Cerstin

Christian Grün

1:11 p.m.

New subject: whitespace

...

Strange. I get the first message "timed out while talking to pbs" for almost every interaction in the GUI. This is new, I didn't get this with former versions. Something must have changed.

Someone else out there getting this behavior? As long as we cannot reproduce this locally, it's very difficult to fix for us. What you can try…

– run different Java versions – do some search on the returned error messages to get a better feeling if this bug is currently being fixed, or has already been fixed, by the Java developers – try different snapshot of BaseX such that we can further isolate the issue (ideally, by checking out the GitHub sources, and finding the commit that has potentially caused the new problems).

Cerstin Mahlow

5:21 p.m.

New subject: whitespace

Zitat von Christian Grün christian.gruen@gmail.com:

...

...
Strange. I get the first message "timed out while talking to pbs" for almost every interaction in the GUI. This is new, I didn't get this with former versions. Something must have changed.

Someone else out there getting this behavior? As long as we cannot reproduce this locally, it's very difficult to fix for us. What you can try?

? run different Java versions ? do some search on the returned error messages to get a better feeling if this bug is currently being fixed, or has already been fixed, by the Java developers

I could isolate the problem, which persists even after updating the Java version. As soon as I copy something to the clipboard, the message appears. The same is reported for other Java applications like Eclipse. So this problem has nothing to do with BaseX. However, I am not sure where the crashing comes from.

Cerstin

4768

Age (days ago)

4818

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

27 comments

5 participants

tags (0)

participants (5)

Cerstin Mahlow
Christian Grün
Imsieke, Gerrit, le-tex
Manfred Knobloch
Michael Piotrowski