Hi,
In "XQuery and XPath Full Text 1.0 Use Cases" [1] it says:
Querying across element boundaries is similar to an XQuery and XPath character string function converting the sub-tree under an element into a string by removing all markup.
However, I'm having trouble to get this to work in BaseX 7.2.1. For example, given this document:
<doc> <p id="1">Lorem ipsum dolor sit amet</p> <p id="2">Lorem ipsum dolor sit amet</p> <p id="3">Lorem ipsum dolor <i>sit</i> amet</p> <p id="4">Lorem ipsum <i>dolor sit</i> amet</p> </doc>
How do I query, for example, for p elements that contain the string "sit amet"?
//p[. contains text 'sit amet']
returns only paragraphs 1 and 2. I've tried a number of variations, but I've been unable to come up with a query that returns all four p elements.
Are my queries incorrect or is this a bug in BaseX?
Thanks and best regards
Footnotes: [1] http://www.w3.org/TR/xpath-full-text-10-use-cases/
Dear Michael,
to get the requested result, you need to deactivate the chopping of whitespaces (via SET CHOP OFF, or Dialog → New… → Parsing → Chop Whitespaces).
Hope this helps, Christian
Footnotes: [1] http://www.w3.org/TR/xpath-full-text-10-use-cases/
-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
- OUT NOW: Systems and Frameworks for Computational Morphology
- http://www.springeronline.com/978-3-642-23137-7
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Christian,
On 2012-05-08, Christian Grün christian.gruen@gmail.com wrote:
to get the requested result, you need to deactivate the chopping of whitespaces (via SET CHOP OFF, or Dialog → New… → Parsing → Chop Whitespaces).
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.
Another question: The examples in the documentation and the actual behavior of BaseX suggest that ft:extract and ft:mark only work for queries on a single text node, i.e., queries such as
//p[text() contains text 'sit amet']
but not for
//p[. contains text 'sit amet']
Is this correct?
Thanks and greetings
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.
Thanks; your edits are welcome.
Another question: The examples in the documentation and the actual behavior of BaseX suggest that ft:extract and ft:mark only work for queries on a single text node, [...]
Yes, that's true. In terms of the syntax, a dot (.) may work as well, if the addressed elements have no more descendant elements, as the following example shows:
x.xml: <x>A</x>
query: ft:mark(doc('x.xml')//x[. contains text 'a'])
Hope this helps, Christian
On 2012-05-08, Christian Grün christian.gruen@gmail.com wrote:
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.
Thanks; your edits are welcome.
Ok, I'll add this then.
Another question: The examples in the documentation and the actual behavior of BaseX suggest that ft:extract and ft:mark only work for queries on a single text node, [...]
Yes, that's true. In terms of the syntax, a dot (.) may work as well, if the addressed elements have no more descendant elements, as the following example shows:
x.xml: <x>A</x>
query: ft:mark(doc('x.xml')//x[. contains text 'a'])
Thanks for the information, that's good to know. I think I'll file an enhancement request then: For text-oriented applications (e.g., TEI documents), it would be extremely useful if ft:mark would work with descendant elements; typically you have lots of mixed content, with elements containing rendering information or annotations, such as <hi>, <orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These elements don't interrupt the logical text flow.
Best regards
Thanks for the information, that's good to know. I think I'll file an enhancement request then: For text-oriented applications (e.g., TEI documents), it would be extremely useful if ft:mark would work with descendant elements; typically you have lots of mixed content, with elements containing rendering information or annotations, such as <hi>, <orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These elements don't interrupt the logical text flow.
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark(). This is demonstrated by the following example:
copy $x := document { <p>This is a <i>simple</i> test.</p> } modify for $t in $x//*[*][text()] return replace value of node $t with data($t) return ft:mark($x//*[text() contains text 'test'])
If speed is a top priority, you may as well create a new database from all updated source files and build a full text index for the new database...
for $i in collection('...') return copy $x := $i modify for $t in $x//*[*][text()] return replace value of node $t with data($t) return db:add(...)
Christian
Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
Thanks for the information, that's good to know. I think I'll file an enhancement request then: For text-oriented applications (e.g., TEI documents), it would be extremely useful if ft:mark would work with descendant elements; typically you have lots of mixed content, with elements containing rendering information or annotations, such as <hi>, <orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These elements don't interrupt the logical text flow.
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
I think it should be
<a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
Each token from the search string would be enclosed in a <mark>-element.
I once asked for continuous marking, i.e., only one opening and one closing mark-tag for a search string consisting of several words. This would be not suitable for elements where I assume the hit to cover inner elements because of overlapping mark-up. But this might be solved with an option for ft:mark, to apply it either continuously or for each token found.
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().
This is a great idea if you would like to know whether the search elements are somewhere in your text.
However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.
And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.
But maybe for searching and displaying the results in the original document, one would have to develop a bigger application.
Best regards
Cerstin
On 2012-05-09, Cerstin Mahlow cerstin.mahlow@unibas.ch wrote:
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
I think it should be
<a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
Each token from the search string would be enclosed in a <mark>-element.
Exactly. While this probably wouldn't cover *all* possible scenarios, it would still cover most of the useful ones. In fact, it would be similar to http://www.raymondhill.net/blog/?p=272. It would also be applicable when ignoring elements in a search.
For complex applications it may help to get the start and end character positions of the matches (essentially standoff markup), and the application could then do the highlighting itself on the basis of this information.
[...]
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().
This is a great idea if you would like to know whether the search elements are somewhere in your text.
However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.
And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.
I fully agree, this is exactly what I need in my application: I don't want to retrieve snippets from the document, but I always have to display the full document with the hits highlighted.
What I'm going to do now is probably highlight the full paragraph which contains the node retrieved by the search, i.e., get the node ID, walk up the tree until I encounter a <p> and get its @xml:id, which I can then use in a CSS stylesheet. Or something like this. But this is clearly only an approximation.
Best regards
Thanks, Cerstin and Michael, for your suggestions.
Yes, all the use cases sound perfectly reasonable to me. When it comes to the implementation, I see quite a number of obstacles to make this happen. One of the reasons is that the full-text expression, as currently implemented, discards all elements before tokenizing the texts. This means that the following queries are basically the same:
<a>X <b>Y</b> Z</a>[. contains text 'X Y'] <a>X <b>Y</b> Z</a>[data() contains text 'X Y']
As Cerstin indicated, you'll probably have to parse all text nodes individually;
ft:mark(//*[text() contains text {'X', 'Y'}])
This simple approach, however, won't work out with phrases (multiple terms) that reach into descendant nodes.
Christian
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
I think it should be
<a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
Each token from the search string would be enclosed in a <mark>-element.
Exactly. While this probably wouldn't cover *all* possible scenarios, it would still cover most of the useful ones. In fact, it would be similar to http://www.raymondhill.net/blog/?p=272. It would also be applicable when ignoring elements in a search.
For complex applications it may help to get the start and end character positions of the matches (essentially standoff markup), and the application could then do the highlighting itself on the basis of this information.
[...]
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().
This is a great idea if you would like to know whether the search elements are somewhere in your text.
However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.
And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.
I fully agree, this is exactly what I need in my application: I don't want to retrieve snippets from the document, but I always have to display the full document with the hits highlighted.
What I'm going to do now is probably highlight the full paragraph which contains the node retrieved by the search, i.e., get the node ID, walk up the tree until I encounter a <p> and get its @xml:id, which I can then use in a CSS stylesheet. Or something like this. But this is clearly only an approximation.
Best regards
-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
- OUT NOW: Systems and Frameworks for Computational Morphology
- http://www.springeronline.com/978-3-642-23137-7
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.
..thanks again for editing our Wiki -- always welcome! I've slightly added your last paragraph to indicate that full-text tokenization always works on the string values of the elements:
http://docs.basex.org/wiki/Full-Text#Mixed_Content
Feel free to further polish it, Christian
yet another fulltext question
I'm looking for a performant way to look up fulltext information in a collection of image descriptions.
The db / document contains less than 10.000 items. The structure is easy:
<bilder> <bild> <InterneId/> <titel/> <kuenstler> <person/> </kuenstler> <datierung/> <dargestellte_person/> ... <widmung></widmung> <beschreibung></beschreibung> </bild> <bild/> <bild/> ... </bilder>
When trying to get the title ('titel') of all nodes that contain a search string 'mag' i get the following:
doc('bilder')/bilder/bild[text() contains text 'mag' ]/titel no result
doc('bilder')/bilder/bild[. contains text 'mag' ]/titel 18 Hits Query Time about -> 950 ms
the following query produces the same result but is much faster ft:search(doc('bilder'),'mag')/ancestor::*[local-name(.) = 'bild']/titel 18 Hits Query Time less than -> 3 ms
Can i say that one should prefer ft:search then climbing the ancestor axis over the 'contains text'. The documentations says the 'contains text' is processed using the fulltext index. If so, how can this performance diffrence be explained?
TIA
-- Manfred Knobloch - Medientechnik -
Leibniz-Institut für Wissensmedien IWM - Knowledge Media Research Center (KMRC) Schleichstraße 6 | 72076 Tübingen Fon: +49 7071 979-340 Internet: http://www.iwm-kmrc.de/m.knobloch
On 2012-05-09, Christian Grün christian.gruen@gmail.com wrote:
Ah, thanks a lot! I would have never guessed this... Maybe the documentation should say something like: "Querying across elements is only supported when whitespace chopping is off." If it's ok with you, I'll add it.
..thanks again for editing our Wiki -- always welcome! I've slightly added your last paragraph to indicate that full-text tokenization always works on the string values of the elements:
Thanks for the clarification. Thank you also for creating the ticket for ft:mark.
Frankly, I find it quite dangerous that CHOP is ON by default. Discarding whitespace in mixed content means losing information. I'd find it preferable if it were off by default; if you know your data and if you are aware of the effects of CHOP, *then* you could turn it on.
Best regards
On 2012-05-11 23:34, Michael Piotrowski wrote:
Frankly, I find it quite dangerous that CHOP is ON by default. Discarding whitespace in mixed content means losing information. I'd find it preferable if it were off by default; if you know your data and if you are aware of the effects of CHOP, *then* you could turn it on.
+1
Whitespace should always be preserved, unless the parser strips it because it knows that it’s ignorable whitespace (because it has been made aware of a schema?).
Looking at Saxon’s strip option, http://www.saxonica.com/documentation/javadoc/net/sf/saxon/s9api/WhitespaceS...
Saxon’s -strip:ignorable has a certain appeal, but when you consider how “ignorable” is specified:
The value IGNORABLE indicates that whitespace text nodes in element-only content are discarded.
fringe cases instantly come to your mind where it will strip a whitespace too much: <p><hi rend="bold">Hello</hi> <hi rend="bolditalic">World</hi></p>
“element-only content:” I’m not sure whether Saxon decides that it is in element-only content based upon information from the parser (“no text node allowed here”) or whether it draws its own conclusions (“no text node present here”).
Hmm.
On 2012-05-12 00:09, Imsieke, Gerrit, le-tex wrote:
I’m not sure whether Saxon decides that it is in element-only content based upon information from the parser (“no text node allowed here”) or whether it draws its own conclusions (“no text node present here”).
s/no text node/no non-WS text node/g;
Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
to get the requested result, you need to deactivate the chopping of whitespaces (via SET CHOP OFF, or Dialog ? New? ? Parsing ? Chop Whitespaces).
I just ran into a related issue:
When creating my collection, I did not deactivate whitespace chopping. This results in:
<p rend="zenoPLm4n0" xml:id="tg5.2.22">Das ich aber Eur<hi rend="italic" xml:id="tg5.2.22.1">Excellenz,</hi>Hochwürden vnd Gnaden dises wintzige Werckel demütigst zuschreibe /hab ich ein sehr fügliche Ursach / weil ich nemlich dises kleine Tractätl habe zusammen getragen in der stattlichen Behausung Ihro Hochgräfflichen<hi rend="italic" xml:id="tg5.2.22.2">Excellenz</hi>Herrn Hanß Balthasar Graffen von<hi rend="italic" xml:id="tg5.2.22.3">Hojos</hi>der Zeit wertisten Landmarschall vnnd geheimen<hi rend="italic" xml:id="tg5.2.22.4">Deputir</hi>ten Rath ...</p>
instead of:
<p rend="zenoPLm4n0" xml:id="tg5.2.22">Das ich aber Eur <hi rend="italic" xml:id="tg5.2.22.1">Excellenz,</hi> Hochwürden vnd Gnaden dises wintzige Werckel demütigst zuschreibe /hab ich ein sehr fügliche Ursach / weil ich nemlich dises kleine Tractätl habe zusammen getragen in der stattlichen Behausung Ihro Hochgräfflichen <hi rend="italic" xml:id="tg5.2.22.2">Excellenz</hi> Herrn Hanß Balthasar Graffen von <hi rend="italic" xml:id="tg5.2.22.3">Hojos</hi> der Zeit wertisten Landmarschall vnnd geheimen <hi rend="italic" xml:id="tg5.2.22.4">Deputir</hi>ten Rath ...</p>
When the user want's to annotate something, he selects the respective text with the mouse, presses the "mark" button and saves the annotation. Later he might wan't to review it, or what ever. So I store the node-id and the marked part to later be able to use ft:mark to show the user what he selected.
On thing that is a bit ugly is that the person "Herrn Hanß Balthasar Graffen von Hojos" is displayed as "Herrn Hanß Balthasar Graffen vonHojos", because of the missing whitespace. However, I can choose to save either
"Herrn Hanß Balthasar Graffen vonHojos" (the data of the selected part) or
"Herrn Hanß Balthasar Graffen von<hi rend="italic" xml:id="tg5.2.22.3">Hojos" (the exact part of the node)
But I have no idea how to make this selection visible in a later step.
The first question:
If I want to get whitespaces back, do I have to re-create the collection? Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?
The second question:
How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think. I cannot store the whole node with the new markup, because in the end I will have different annotations for the same node (so I would store various versions of this node) and I don't see a clever way to collapse different versions of a node keeping all annotations to replace the original node.
Best regards
Cerstin
Hi Cerstin,
If I want to get whitespaces back, do I have to re-create the collection?
Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.
Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?
The node ids will change if the documents include pure whitespace texts. The following example represents such a document; it contains three text nodes ("X", and two text nodes with a single newline character):
<hello> <world>X</world> </hello>
How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think.
I'm not quite sure what you refer to here; could you attach a small example? Christian
PS@Michael and Gerrit: thanks for your opinion. One of the reasons for the chopping whitespaces by default is that whitespace texts in structured documents consume a lot of space in a database, although they will never need to be processed. However, I see that this solution may cause more confusion than be helpful, which is why we'll think about switching the default behavior.
On 2012-05-13, Christian Grün christian.gruen@gmail.com wrote:
If I want to get whitespaces back, do I have to re-create the collection?
Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.
Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?
The node ids will change if the documents include pure whitespace texts. The following example represents such a document; it contains three text nodes ("X", and two text nodes with a single newline character):
<hello> <world>X</world> </hello>
I'll be working with Cerstin on this issue, so here's a brief comment. Thanks for the example, that's what I feared ... I think we're lucky that we're only dealing with node IDs of elements, so we can annotate the elements with ID attributes, associate the node IDs with the XML IDs, and then translate them again to the node IDs of the "unchopped" database. If we were dealing with node IDs of text nodes, we'd hosed ...
How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think.
I'm not quite sure what you refer to here; could you attach a small example? Christian
I *think* what she means is: Since
ft:mark(//p[. contains text 'real'])
will not highlight anything if . contains mixed content with multiple text nodes, what is the best approach to highlight the results of a search, given a query and a matching node?
PS@Michael and Gerrit: thanks for your opinion. One of the reasons for the chopping whitespaces by default is that whitespace texts in structured documents consume a lot of space in a database, although they will never need to be processed.
Yes, I figured that it was intended for data-oriented documents.
However, I see that this solution may cause more confusion than be helpful, which is why we'll think about switching the default behavior.
This would be very welcome! Your example above also nicely illustrates the problem. As the significance of whitespace in XML can only be determined when there's a schema, chopping whitespace by default means that, strictly speaking, documents are altered semantically on import unless you take special precautions--it should definitely be the other way round.
Best regards
Thanks Michael,
a short one..
ft:mark(//p[. contains text 'real'])
will not highlight anything if . contains mixed content with multiple text nodes, what is the best approach to highlight the results of a search, given a query and a matching node?
You may want to add the "any word" option to search all specified words as single terms; e.g.:
let $terms := ( 'real' ) return ft:mark(//p[.//text() contains text { $terms } any word])
This may yield unexpected results for phrases, though.
Hope this helps? Christian
Hi,
I come back to this thread after some time:
Zitat von Christian Grün christian.gruen@gmail.com:
If I want to get whitespaces back, do I have to re-create the collection?
Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.
Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?
The node ids will change if the documents include pure whitespace texts.
I see.
Maybe someone can give me a hint on how to solve this problem:
I have a collection (Text-DB) created with whitespaces choped. Users already worked with this collection and so I have a relatively huge database (Collect-DB) consisting of 150 000 entries like this one:
<entry> <node>12345</node> <id>Ad0001</id> <query>contains abcd</query> </entry>
The "node" element contains the node-id from Text-DB where a certain xquery matched. The relevant nodes are paragraphs or lines from a TEI-document. I use the node-id and the query (as stored in the "query" element) in a later processing step to show the user the node with the relevant part by applying the original query to the original node using ft:mark.
When I re-create the collection with whitespace-chopping turned off, preserving the sequence of documents as in the whitespace-choped collection, the stored node-ids from Collect-DB would refer to completely different nodes. There is no way I could convince the users to do all the work again.
So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.
Could this be done using BaseX or should I rather do some Perl-scripting?
Best regards
Cerstin -- Dr. phil. Cerstin Mahlow
Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.
Hi Cerstin,
[…] So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.
Could this be done using BaseX or should I rather do some Perl-scripting?
a straightforward solution could look as follows: _________________________
declare option output:separator '\n'; declare variable $texts1 := db:open('Text-DB')//text(); declare variable $texts2 := db:open('Text-DB-WS')//text();
for $text1 in $texts1 let $str1 := normalize-space($text1) let $id1 := db:node-id($text1) return $id1 || ': ' || string-join( for $text2 in $texts2 where $str1 = normalize-space($text2) return string(db:node-id($text2)) , ',') _________________________
The query retrieves all text nodes of the two databases. In a nested loop, all strings are compared against each other, and the resulting output will list the ids of the text nodes of the first document, followed by the ids of matchings texts of the second node:
3: 4,13 5: 7 7: 10 9: 4,13
If the database is too large, however, this approach may be too slow due to its O(n²) runtime. In that case, XQuery maps or the "group by" statement could probably be used to reduce the number of comparisons.
I hope this serves as a first inspiration, Christian
Hi,
On 2012-06-25, Cerstin Mahlow cerstin.mahlow@unibas.ch wrote:
So my idea was to have the original Text-DB (without whitespace) and the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB have corresponding nodes in Text-DB-WS, they only differ concerning the node-id. So I should be able to detect which node-id of Text-DB corresponds to which node-id of Text-DB-WS. And then I could create a new version of Collect-DB by replacing the value of all "node" elements with the respective node-id from Text-DB-WS.
I think this is doable. As you're only interested in *element* nodes (<p> and <l>), we can be certain that any node in Text-DB is also in Text-DB-WS, and that the path to a particular node in both databases is identical.
Here's my go at it. For simplicity, the variable $nodes contains the information that would actually come from Collect-DB.
--8<---------------cut here---------------start------------->8--- xquery version "3.0";
declare option output:separator '\n';
declare variable $bad := db:open('Text-DB'); declare variable $nodes := <nodes><id>499</id><id>713</id></nodes>;
for $id in $nodes//id let $path := replace(db:open-id($bad, $id)/path(), 'Q{.*?}', '*:') return $id || ' → ' || xquery:eval('db:node-id(db:open("Text-DB-WS")' || $path || ')') --8<---------------cut here---------------end--------------->8---
Apparently the return value from path() is not a valid XPath expression; as a workaround I simply replace the "Q{...}" namespace stuff with "*:". But I'm not an XQuery hacker, so there's probably a better way... In any case, the above code works on my test database.
HTH and greetings
To complement this: while not completely made public yet (the next W3 working drafts are to be expected soon), the syntax returned by fn:path() is actually a valid XPath 3.0 expression; see [1] for more details.
Christian
[1] http://docs.basex.org/wiki/XQuery_3.0#Expanded_QNames
--8<---------------cut here---------------start------------->8--- xquery version "3.0";
declare option output:separator '\n';
declare variable $bad := db:open('Text-DB'); declare variable $nodes := <nodes><id>499</id><id>713</id></nodes>;
for $id in $nodes//id let $path := replace(db:open-id($bad, $id)/path(), 'Q{.*?}', '*:') return $id || ' → ' || xquery:eval('db:node-id(db:open("Text-DB-WS")' || $path || ')') --8<---------------cut here---------------end--------------->8---
Apparently the return value from path() is not a valid XPath expression; as a workaround I simply replace the "Q{...}" namespace stuff with "*:". But I'm not an XQuery hacker, so there's probably a better way... In any case, the above code works on my test database.
HTH and greetings
-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
- OUT NOW: Systems and Frameworks for Computational Morphology
- http://www.springeronline.com/978-3-642-23137-7
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Christian,
On 2012-06-27, Christian Grün christian.gruen@gmail.com wrote:
To complement this: while not completely made public yet (the next W3 working drafts are to be expected soon), the syntax returned by fn:path() is actually a valid XPath 3.0 expression; see [1] for more details.
Thanks for the clarification. The example given in the wiki
Q{http://www.w3.org/2005/xpath-functions/math%7Dpi()
works, but paths returned by path() don't work for me, e.g.,
/Q{http://www.tei-c.org/ns/1.0%7DTEI%5B1%5D/Q%7Bhttp://www.tei-c.org/ns/1.0%7Dt...]
or, for that matter,
/Q{http://www.tei-c.org/ns/1.0%7DTEI
neither raise an error nor do they match anything.
For the same database,
declare namespace tei = "http://www.tei-c.org/ns/1.0"; /tei:TEI
works as expected.
Bug?
Best regards
Hi,
Zitat von Michael Piotrowski mxp@cl.uzh.ch:
As you're only interested in *element* nodes (<p> and <l>), we can be certain that any node in Text-DB is also in Text-DB-WS, and that the path to a particular node in both databases is identical.
Thanks for your code! As my collections consists of different documents, I had to include the document uri, otherwise the paths are ambiguous:
declare option output:separator '\n';
for $id in //entry/node/data() let $path := replace(db:open-id('Digibib-DTA-fuzzy', $id)/path(), 'Q{.*?}', '*:') let $base := replace(base-uri(db:open-id('Digibib-DTA-fuzzy', $id)), 'fuzzy', 'fuzzy-ws') return $id || ': ' || xquery:eval(concat('db:node-id(doc("', $base, '")', $path, ')'))
The replacement in $base changes the document-uri of the original collection to the new one. $id is extracted from collect-DB. Is there another way to get the complete path to the node without concatenating base-uri to path() avoiding the eval-construction?
It seems to work fine, I can create pairs of old and new node-ids for a test collect-DB with 15 entries.
For the entire collect-DB, when returning only $path or/and $base instead of executing the eval-construction, everything runs smoothly and takes around 74000 ms in the current 7.3.1 GUI for 108000 ids. Which is OK I think.
However, I get this in the console where I started basexgui:
2012-06-27 11:50:58.505 java[9887:1707] __CFServiceControllerBeginPBSLoadForLocalizations timed out while talking to pbs
What does this mean?
If I run the whole query (i.e., with the eval-construction), the GUI crashes and in the console this appears:
/opt/basex/bin/basexgui: line 32: 9907 Segmentation fault java -cp "$CP" $VM "${vm_args[@]}" org.basex.BaseXGUI "${general_args[@]}"
Any suggestions what this means and how to fix it?
Here some more info from the Apple "Fehlerbericht", I can send more, if needed.
Process: java [9516] Path: /usr/bin/java Identifier: com.apple.javajdk16.cmd Version: 1.0 (1.0) Code Type: X86-64 (Native) Parent Process: bash [9511]
Date/Time: 2012-06-26 23:45:42.008 +0200 OS Version: Mac OS X 10.6.8 (10K549) Report Version: 6
Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000000000b8 Crashed Thread: 6 Java: VM Thread
Best regards
Cerstin
Hi Cerstin,
I'm sorry all bugs seem to be related to Java, particular the OSX versions of Java, and not BaseX itself, which is why we can't do here anything.
Christian ___________________________
However, I get this in the console where I started basexgui:
2012-06-27 11:50:58.505 java[9887:1707] __CFServiceControllerBeginPBSLoadForLocalizations timed out while talking to pbs
What does this mean?
If I run the whole query (i.e., with the eval-construction), the GUI crashes and in the console this appears:
/opt/basex/bin/basexgui: line 32: 9907 Segmentation fault java -cp "$CP" $VM "${vm_args[@]}" org.basex.BaseXGUI "${general_args[@]}"
Any suggestions what this means and how to fix it?
Here some more info from the Apple "Fehlerbericht", I can send more, if needed.
Process: java [9516] Path: /usr/bin/java Identifier: com.apple.javajdk16.cmd Version: 1.0 (1.0) Code Type: X86-64 (Native) Parent Process: bash [9511]
Date/Time: 2012-06-26 23:45:42.008 +0200 OS Version: Mac OS X 10.6.8 (10K549) Report Version: 6
Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000000000b8 Crashed Thread: 6 Java: VM Thread
Best regards
Cerstin
-- Dr. phil. Cerstin Mahlow
Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
This message was sent using IMP, the Internet Messaging Program.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
I'm sorry all bugs seem to be related to Java, particular the OSX versions of Java, and not BaseX itself, which is why we can't do here anything.
Strange. I get the first message "timed out while talking to pbs" for almost every interaction in the GUI. This is new, I didn't get this with former versions. Something must have changed.
However, I could ran the query in the GUI on the linux-server. Creating node-id pairs for 108359 ids took 9 047 211 ms, that's two and a half hours. So probably actually replacing node-ids in my collect-DB will take even longer ...
Best regards
Cerstin
Strange. I get the first message "timed out while talking to pbs" for almost every interaction in the GUI. This is new, I didn't get this with former versions. Something must have changed.
Someone else out there getting this behavior? As long as we cannot reproduce this locally, it's very difficult to fix for us. What you can try…
– run different Java versions – do some search on the returned error messages to get a better feeling if this bug is currently being fixed, or has already been fixed, by the Java developers – try different snapshot of BaseX such that we can further isolate the issue (ideally, by checking out the GitHub sources, and finding the commit that has potentially caused the new problems).
Zitat von Christian Grün christian.gruen@gmail.com:
Strange. I get the first message "timed out while talking to pbs" for almost every interaction in the GUI. This is new, I didn't get this with former versions. Something must have changed.
Someone else out there getting this behavior? As long as we cannot reproduce this locally, it's very difficult to fix for us. What you can try?
? run different Java versions ? do some search on the returned error messages to get a better feeling if this bug is currently being fixed, or has already been fixed, by the Java developers
I could isolate the problem, which persists even after updating the Java version. As soon as I copy something to the clipboard, the message appears. The same is reported for other Java applications like Eclipse. So this problem has nothing to do with BaseX. However, I am not sure where the crashing comes from.
Cerstin
basex-talk@mailman.uni-konstanz.de