My question is: Does the scores given back represent tf/idf?

It depends on your full text index; which scoring mode have you chosen? Next, please have a look at the Query Info (Query -> Query Info) to get some more insight into the internals.

Best,

Christian

Kind regards,

Wiard

2011/4/3 Wiard Vasen <wiard.vasen@gmail.com>
Thank you very much Andreas,

You, Christian and

Leonard have helped me a lot!

Have a nice evening!

Regards,

Wiard

have helped me a lot!

2011/4/3 Andreas Weiler <andreas.weiler@uni-konstanz.de>

I did think that you first get the positions of your first and last document in your range (first query).
Note them and put them into the second query for x and y.

-- Andreas

Am 03.04.2011 um 18:44 schrieb Wiard Vasen:

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh")

where $pos > ( "let001.xml") and $pos < ( "let201.xml")
return
for $i in $n//*
where $i[text() contains text 'above']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler <andreas.weiler@uni-konstanz.de>

Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh")
return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh")
where $pos > x and $pos < y
return
for $i in $n//*
where $i[text() contains text 'above']

return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml")

return
for $i at $pos in db:open("tfidfbrievenvangogh")//*

where ends-with(base-uri($i), $n)
and $i[text() contains text 'above']

return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689")

return

for $i at $pos in db:open("tfidfbrievenvangogh")//*
where ends-with(base-uri($i), $n)

and $i[text() contains text 'above']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler <andreas.weiler@uni-konstanz.de>

Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml")
return
for $i at $pos in db:open("tfidfbrievenvangogh")//*
where ends-with(base-uri($i), $n)

and $i[text() contains text 'kleur']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps,
Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question.
Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//*
where $i[text() contains text 'kleur']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml.

What means that I know that in this partition xml-files Van Gogh was in Arles.
And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets.

And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler <andreas.weiler@uni-konstanz.de>

Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//*

where $i[text() contains text 'xml']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help!
I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml'])
And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents.

I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün <christian.gruen@gmail.com>

Hi Wiard,

the tf/idf scoring is only available if you are working with full-text
index structures. If you have built a full-text index for your
database "DB", the following query will yield different scoring
results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property
with a db command or explicitly choose the type of scoring in the
GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF
Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler
<andreas.weiler@uni-konstanz.de> wrote:
> Dear Wiard Vasen,
> you just need to set the scoring property once.
> If you work in the GUI:
> Go to the top input bar, choose command and type:
> set scoring *
>
> as * set the scoring algorithm you like.
> In the console just type: set scoring *
> After setting this you can use the score function, like in the 8th query of
> our online demo (basex.org/products/live-demo):
> let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels')
> for $name at $pos score $score in $names[. contains text 'Jack']
> order by $score descending
> return <name pos='{ $pos }'>{ $name }</name>
> Don't hesitate to ask for more,
> Andreas
> Am 30.03.2011 um 22:02 schrieb Wiard Vasen:
>
> Dear sirs of Basex,
> I am doing my Master thesis on the letters of Vincent van Gogh at the
> University of Amsterdam.
> For that purpose I use BaseX to analyze the letters.
> I wonder, is there the possibility to generate a tf/idf score automatically?
> In your faq I noticed there needs to be a special term like 'SET SCORING 0'
> to be able to get a tf/idf score.
> This information I get from the following
> page: http://docs.basex.org/wiki/Full-Text
> Could you help me with this?
> I would be very grateful.
> Kind regards,
> _______________________________________________
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>
> _______________________________________________
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>