As far as I can tell, the only difference is that the first looks specifically in the condition field (<condition>BRCA Mutated</condition>) while the latter queries the entire XML (the individual text nodes), as Fabrice suggested.
See the attached XML that is being matched (CT.gov trial NCT01853306).On September 19, 2017 at 5:04:28 AM, Christian Grün (christian.gruen@gmail.com) wrote:
> Yes, this helps. By index rewritings, are you referring to the indices
> created when FTINDEX is set to true?
Ditto.
>
> Thanks,
> Ron
>
> On September 18, 2017 at 11:12:54 AM, Christian Grün
> (christian.gruen@gmail.com) wrote:
>
> Hi Ron,
>
> With mixed-content, it can be beneficial if element boundaries are
> ignored. An example:
>
> <xml><b>H</b>ello world!</xml>
> contains text 'hello'
>
> If you set the CHOP option to false before creating a database,
> whitespaces will be included in your database. As Fabrice has pointed
> out, however, it is usually better to directly address the text nodes
> of your database; otherwise, you won’t be able to benefit from the
> index rewritings.
>
> Hope this helps,
> Christian
>
>
>
> On Mon, Sep 11, 2017 at 4:59 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
>> Thanks Fabrice and Michael. Solution (1) works great!
>>
>> A parting question: why not make the default behavior when querying the
>> textual representation of a document to not “chop” away critical word
>> boundary delimiters? So, in the example below it would return
>>
>> XQuery
>> and XPAth are awesome
>>
>> The munging together of "XPAth" and “are” seems counter intuitive to me.
>>
>> Best,
>> Ron
>>
>> On September 11, 2017 at 4:13:54 AM, Michael Seiferle (ms@basex.org)
>> wrote:
>>
>> Hi Ron,
>> Hi Fabrice,
>>
>> Your observation w.r.t. to element boundaries is right, the document is
>> converted to a textual representation, by default it returns all nodes in
>> their string representation:
>>
>> $doc :=
>>
>> <doc>
>> XQuery
>> <_>and XPAth</_>
>> <_>are awesome</_>
>> </doc>/data()
>>
>> Will turn to:
>>
>>
>> XQuery
>> and XPAthare awesome
>>
>>
>> So:
>>
>> $doc contains text { 'XPath‘ }
>>
>>
>> will return false.
>>
>> You have 3.5 options:
>>
>> 1) => as Fabrice showed, query the individual text nodes
>>
>> 2) use the ft:search() Function to query the index directly,
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs. basex.org_wiki_Full-2DText- 5FModule-23ft-3Asearch&d= DwIFaQ&c=fi2D4- 9xMzmjyjREwHYlAw&r=44jDQvzmnB_ - ovfO6Iusj0ItciJrcWMOQQwd2peEBB E&m=n_ahruJkCgxM-EH4- m0dMIKL305fX-u2hwEeRQfL_v4&s= 3ALZg_foDFZOpL2OY8SZS_ E053zSfBiBcqtQ7Fl98m4&e=
>>
>> ft:search(
>> 'CTGovDebug',
>> 'neoplasms'
>> )/.. (: get parent element for the matching text()-node
>>
>>
>> 3) disable chopping when creating the database,
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs. basex.org_wiki_Options-23XML- 5FParsing&d=DwIFaQ&c=fi2D4- 9xMzmjyjREwHYlAw&r=44jDQvzmnB_ - ovfO6Iusj0ItciJrcWMOQQwd2peEBB E&m=n_ahruJkCgxM-EH4- m0dMIKL305fX-u2hwEeRQfL_v4&s= dUP3VlR3Skm4sDb5U1tQAo0eK2Fc3x bgFNsl41XZ-Lc&e=
>>
>> db:create(
>> 'CTGovDebug',
>> "Path/to/NCT00473512.xml",
>> "NCT00473512.xml",
>>
>> map {
>> 'ftindex': true(),
>> 'chop': false()
>> })
>>
>>
>> 3.5) use the xml:space="preserve“ attribute to tell the parser not to chop
>> child nodes of <clinical_study/> when creating a database:
>>
>> <clinical_study xml:space="preserve">
>> <!-- This xml conforms to an XML Schema at:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ clinicaltrials.gov_ct2_html_ images_info_public.xsd&d= DwIFaQ&c=fi2D4- 9xMzmjyjREwHYlAw&r=44jDQvzmnB_ - ovfO6Iusj0ItciJrcWMOQQwd2peEBB E&m=n_ahruJkCgxM-EH4- m0dMIKL305fX-u2hwEeRQfL_v4&s= Y8p_ znztMroi9xbxY8TRgECRqNyWSJYuPZ WMIgeZopc&e=
>> -->
>> <required_header>
>> <download_date>ClinicalTrials.gov processed this data on August 31,
>> 2017</download_date>
>> <link_text>Link to the current ClinicalTrials.gov record.</link_text>
>>
>>
>>
>> Hope this helped shed some light :-)
>>
>> Best from Konstanz
>> Michael
>> --
>> Michael Seiferle, BaseX GmbH,
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www. basexgmbh.de&d=DwIFaQ&c=fi2D4- 9xMzmjyjREwHYlAw&r=44jDQvzmnB_ - ovfO6Iusj0ItciJrcWMOQQwd2peEBB E&m=n_ahruJkCgxM-EH4- m0dMIKL305fX-u2hwEeRQfL_v4&s= DUaqsc-g-lnjiBM_ qG1YH2IUb0rNL0CwOYYzSbcXoM4&e=
>> |-- Firmensitz: Obere Laube 73, 78462 Konstanz
>> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
>> | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
>> `-- Tel: +49 7531 916 82 77
>>
>> Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD
>> <fetanchaud@pch.cerfrance.fr>:
>>
>> Hello Ron,
>>
>> I don’t know how ft operators behave on document nodes.
>> Supposing documents are converted to their data() representation, Your
>> query
>> would yield the same negative answer.
>> You should consider applying ft operators on text nodes like this :
>>
>> for $trial in db:open('NCT00473512')//text() (:
>> [clinical_study/id_info/nct_id='NCT00473512'] :)
>> return $trial[. contains text { 'neoplasms' }]
>>
>> Best regards,
>> Fabrice Etanchaud
>>
>>
>> De : basex-talk-bounces@mailman.uni-konstanz.de
>> [mailto:basex-talk-bounces@mailman.uni-konstanz.de ] De la part de Ron
>> Katriel
>> Envoyé : lundi 11 septembre 2017 00:42
>> À : BaseX
>> Objet : [basex-talk] Issue with Full Text Retrieval
>>
>> Hi,
>>
>> I am seeing strange behavior with Full Text retrieval. The following query
>> fails for a number of words that are in the XML document (see attached):
>>
>> for $trial in db:open('CTGovDebug)' (:
>> [clinical_study/id_info/nct_id='NCT00473512'] :)
>> return $trial contains text { 'neoplasms' }
>>
>> It fails on a good number of words including neoplasms, cougar, industry,
>> yes, completed, november, 2005, interventional, single, male, female,
>> assignment, none, research, principal, primary, secondary, age, years,
>> gender, etc. But it matches most of the words in the file.
>>
>> Observation: The words that fail are located at the beginning and/or end
>> of
>> the text and do not occur anywhere else in the middle of any text.
>>
>> The document is the only one in the database. It does not make a
>> difference
>> whether full text indexing is on or off. My BaseX version is 8.6.4.
>>
>> Thanks,
>> Ron
>>
>>
>> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>> 350 Hudson Street, 7th Floor, New York, NY 10014
>> rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
>> main: +1 212 918 1800
>>
>>