Re: [basex-talk] Issue with Full Text Retrieval

6 Nov 2017

      Hi Ron,

As far as I can tell, the only difference is that the first looks
specifically in the condition field (<condition>BRCA Mutated</condition>)
while the latter queries the entire XML (the individual text nodes), as
Fabrice suggested.

I havent’t run your query, but I wanted to point out that your second query
won’t query the individual text nodes, but instead will concatenate all
texts, and try to find the single token 'brca' inside that string.

Does this answer your question?
Christian

See the attached XML that is being matched (CT.gov trial NCT01853306).

Thanks,
Ron

Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
<http://www.mdsol.com/>
350 Hudson Street, 7th Floor, New York, NY 10014
<https://maps.google.com/?q=350+Hudson+Street,+7th+Floor,+New+York,+NY+10014&entry=gmail&source=g>
rkatriel@mdsol.com <tbrophy@mdsol.com> | direct: +1 201 337 3622 | mobile: +1
201 675 5598 | main: +1 212 918 1800

On September 19, 2017 at 5:04:28 AM, Christian Grün (
christian.gruen@gmail.com) wrote:
...
Yes, this helps. By index rewritings, are you referring to the indices
created when FTINDEX is set to true?
Ditto.
...
Thanks,
Ron
On September 18, 2017 at 11:12:54 AM, Christian Grün
(christian.gruen@gmail.com) wrote:
Hi Ron,
With mixed-content, it can be beneficial if element boundaries are
ignored. An example:
<xml><b>H</b>ello world!</xml>
contains text 'hello'
If you set the CHOP option to false before creating a database,
whitespaces will be included in your database. As Fabrice has pointed
out, however, it is usually better to directly address the text nodes
of your database; otherwise, you won’t be able to benefit from the
index rewritings.
Hope this helps,
Christian
On Mon, Sep 11, 2017 at 4:59 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
...
Thanks Fabrice and Michael. Solution (1) works great!
A parting question: why not make the default behavior when querying the
textual representation of a document to not “chop” away critical word
boundary delimiters? So, in the example below it would return
XQuery
and XPAth are awesome
The munging together of "XPAth" and “are” seems counter intuitive to me.
Best,
Ron
On September 11, 2017 at 4:13:54 AM, Michael Seiferle (ms@basex.org)
wrote:
Hi Ron,
Hi Fabrice,
Your observation w.r.t. to element boundaries is right, the document is
converted to a textual representation, by default it returns all nodes
in
...
...
their string representation:
$doc :=
<doc>
XQuery
<_>and XPAth</_>
<_>are awesome</_>
</doc>/data()
Will turn to:
XQuery
and XPAthare awesome
So:
$doc contains text { 'XPath‘ }
will return false.
You have 3.5 options:
1) => as Fabrice showed, query the individual text nodes
2) use the ft:search() Function to query the index directly,
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.
basex.org_wiki_Full-2DText-5FModule-23ft-3Asearch&d=DwIFaQ&c=fi2D4-
9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBB
E&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=3ALZg_foDFZOpL2OY8SZS_
E053zSfBiBcqtQ7Fl98m4&e=
ft:search(
'CTGovDebug',
'neoplasms'
)/.. (: get parent element for the matching text()-node
3) disable chopping when creating the database,
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.
basex.org_wiki_Options-23XML-5FParsing&d=DwIFaQ&c=fi2D4-
9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBB
E&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=
dUP3VlR3Skm4sDb5U1tQAo0eK2Fc3xbgFNsl41XZ-Lc&e=
db:create(
'CTGovDebug',
"Path/to/NCT00473512.xml",
"NCT00473512.xml",
map {
'ftindex': true(),
'chop': false()
})
3.5) use the xml:space="preserve“ attribute to tell the parser not to
chop
child nodes of <clinical_study/> when creating a database:
<clinical_study xml:space="preserve">
<!-- This xml conforms to an XML Schema at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__
clinicaltrials.gov_ct2_html_images_info_public.xsd&d=DwIFaQ&c=fi2D4-
9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBB
E&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=Y8p_
znztMroi9xbxY8TRgECRqNyWSJYuPZWMIgeZopc&e=
-->
<required_header>
<download_date>ClinicalTrials.gov processed this data on August 31,
2017</download_date>
<link_text>Link to the current ClinicalTrials.gov record.</link_text>
Hope this helped shed some light :-)
Best from Konstanz
Michael
--
Michael Seiferle, BaseX GmbH,
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
basexgmbh.de&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-
ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-
m0dMIKL305fX-u2hwEeRQfL_v4&s=DUaqsc-g-lnjiBM_qG1YH2IUb0rNL0CwOYYzSbcXoM4&e=
|-- Firmensitz: Obere Laube 73, 78462 Konstanz
<https://maps.google.com/?q=Obere+Laube+73,+78462+Konstanz&entry=gmail&source=g>
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
| Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Tel: +49 7531 916 82 77 <07531%209168277>
Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD
<fetanchaud@pch.cerfrance.fr>:
Hello Ron,
I don’t know how ft operators behave on document nodes.
Supposing documents are converted to their data() representation, Your
query
would yield the same negative answer.
You should consider applying ft operators on text nodes like this :
for $trial in db:open('NCT00473512')//text() (:
[clinical_study/id_info/nct_id='NCT00473512'] :)
return $trial[. contains text { 'neoplasms' }]
Best regards,
Fabrice Etanchaud
De : basex-talk-bounces@mailman.uni-konstanz.de
[mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Ron
Katriel
Envoyé : lundi 11 septembre 2017 00:42
À : BaseX
Objet : [basex-talk] Issue with Full Text Retrieval
Hi,
I am seeing strange behavior with Full Text retrieval. The following
query
fails for a number of words that are in the XML document (see attached):
for $trial in db:open('CTGovDebug)' (:
[clinical_study/id_info/nct_id='NCT00473512'] :)
return $trial contains text { 'neoplasms' }
It fails on a good number of words including neoplasms, cougar,
industry,
yes, completed, november, 2005, interventional, single, male, female,
assignment, none, research, principal, primary, secondary, age, years,
gender, etc. But it matches most of the words in the file.
Observation: The words that fail are located at the beginning and/or end
of
the text and do not occur anywhere else in the middle of any text.
The document is the only one in the database. It does not make a
difference
whether full text indexing is on or off. My BaseX version is 8.6.4.
Thanks,
Ron
Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
350 Hudson Street, 7th Floor, New York, NY 10014
<https://maps.google.com/?q=350+Hudson+Street,+7th+Floor,+New+York,+NY+10014&entry=gmail&source=g>
rkatriel@mdsol.com | direct: +1 201 337 3622 <+1%20201-337-3622> |
mobile: +1 201 675 5598 <+1%20201-675-5598> |
main: +1 212 918 1800 <+1%20212-918-1800>