Re: [basex-talk] tf/idf

5 Apr 2011


      Dear Christian,
When I initialized the database I marked in 'Full Text' properties the TF /
IDF checkbox.
So, I think that 'score'  in the query gives this score back.
Do you think I am right?
Thanks in advance for your answer.
Kind regards,
Wiard
2011/4/5 Christian Grün christian.gruen@gmail.com
...
Dear Wiard,
My question is: Does the scores given back represent tf/idf?
...
It depends on your full text index; which scoring mode have you
chosen? Next, please have a look at the Query Info (Query -> Query Info) to
get some more insight into the internals.
Best,
Christian
...
Kind regards,
Wiard
2011/4/3 Wiard Vasen wiard.vasen@gmail.com
...
Thank you very much Andreas,
You, Christian and  Leonard have helped me a lot!
Have a nice evening!
Regards,
Wiard
have helped me a lot!
2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de
...
I did think that you first get the positions of your first and last
document in your range (first query).
Note them and put them into the second query for x and y.
-- Andreas
Am 03.04.2011 um 18:44 schrieb Wiard Vasen:
Hi Andreas,
Maybe I don't understand the query you suggested. I worked it out this
way:
for $n at $pos in db:open("tfidfbrievenvangogh")
where $pos > ( "let001.xml") and $pos < ( "let201.xml")
return
for $i in $n//*
where $i[text() contains text 'above']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
'above'])}</score></hit>
I do understand though the error: xs:integer and xs:string can't be
compared
How do I improve this query, so that it works?
Regards,
Wiard
2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de
...
Hi Wiard,
you can get the position of your wished first and last document with:
(add a where clause to get the right documents, like where
ends-with(base-uri($n), "test.doc"))
for $n at $pos in db:open("tfidfbrievenvangogh")
return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>
then set these position for x and y in the below query.
for $n at $pos in db:open("tfidfbrievenvangogh")
where $pos > x and $pos < y
return
for $i in $n//*
where $i[text() contains text 'above']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
'above'])}</score></hit>
I hope this works.
-- Andreas
Am 03.04.2011 um 18:14 schrieb Wiard Vasen:
Hi Andreas,
Thanks a lot! It works fine.
I was wondering if instead of putting in the next query in BaseX:
for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml",
"let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml",
"let689.xml")
return
for $i at $pos in db:open("tfidfbrievenvangogh")//*
where ends-with(base-uri($i), $n)
and $i[text() contains text 'above']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
'above'])}</score></hit>
It is also possible to do something like:
for  ("let680.xml" )<= $n <= ("let689")
return
for $i at $pos in db:open("tfidfbrievenvangogh")//*
where ends-with(base-uri($i), $n)
and $i[text() contains text 'above']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
'above'])}</score></hit>
That way I hope to define the outer documents of a subset and get all
the documents in between, with the outer documents included.
Do you think this is possible in a query like shown above?
Regards,
Wiard
2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de
...
Hi Wiard,
i hope i understand your plans, here is what i would do:
for $n in ("betweenlet567.xml", "let689.xml")
return
for $i at $pos in db:open("tfidfbrievenvangogh")//*
where ends-with(base-uri($i), $n)
and $i[text() contains text 'kleur']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
'kleur'])}</score></hit>
Now you can extend the variable $n with all filenames you like to
have.
I hope this helps,
Andreas
Am 03.04.2011 um 14:24 schrieb Wiard Vasen:
Hi Andreas,
Wow! This is the complete answer to my question!
I hope you can help me with the next question.
Because I am analyzing changes in the artistic life of Van Gogh, I am
partitioning the relatively large repository annotated xml files on the
basis of residence.
For that reason I need to put a query like:
for $i at $pos in db:open("tfidfbrievenvangogh")//*
where $i[text() contains text 'kleur']
return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
'kleur'])}</score></hit>
with the extension: given the interval, all xml-files
betweenlet567.xml  and let689.xml.
What means that I know that in this partition xml-files Van Gogh was
in Arles.
And I want to know what is the tf-idf score of the dutch word 'kleur'.
To give a resume of my question: How do I partition the repository in
subsets, so that I can produce information on these subsets.
And how do I do this in BasX with xquery.
Thanks a lot beforehand!
Kind regards,
Wiard
2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de
> Hi Wiard,
>
> you could use the base-uri function of XQuery, like (probably can be
> done easier):
>
> for $i at $pos in db:open("DB")//*
> where $i[text() contains text 'xml']
> return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text
> 'xml'])}</score></hit>
>
> -- Andreas
>
> Am 03.04.2011 um 12:42 schrieb Wiard Vasen:
>
> Dear Christian and Andreas,
>
> Thanks for your great help!
> I used Christians solution: ft:score(db:open("DB")//*[text()
> contains text 'xml'])
> And it works fine.
>
> The next step is that I want to get the associated documents with
> these scores.
>
> Could you help me with this step?
>
> The results I get now is a list with frequencies, without the
> references to the particular documents.
> I think what is needed is the tf/idf score.
>
> Regards,
>
> Wiard
>
>
> 2011/3/31 Christian Grün christian.gruen@gmail.com
>
>> Hi Wiard,
>>
>> the tf/idf scoring is only available if you are working with
>> full-text
>> index structures. If you have built a full-text index for your
>> database "DB", the following query will yield different scoring
>> results, depending on the chosen scoring model:
>>
>>  ft:score(db:open("DB")//*[text() contains text 'xml'])
>>
>> As Andreas indicated, however, you may either set the SCORING
>> property
>> with a db command or explicitly choose the type of scoring in the
>> GUI's database creation dialog (Database -> New -> Fulltext ->
>> TF/IDF
>> Scoring).
>>
>> Christian
>>
>>
>>
>> On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler
>> andreas.weiler@uni-konstanz.de wrote:
>> > Dear Wiard Vasen,
>> > you just need to set the scoring property once.
>> > If you work in the GUI:
>> > Go to the top input bar, choose command and type:
>> > set scoring *
>> >
>> > as * set the scoring algorithm you like.
>> > In the console just type: set scoring *
>> > After setting this you can use the score function, like in the 8th
>> query of
>> > our online demo (basex.org/products/live-demo):
>> > let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels')
>> > for $name at $pos score $score in $names[. contains text 'Jack']
>> > order by $score descending
>> > return <name pos='{ $pos }'>{ $name }</name>
>> > Don't hesitate to ask for more,
>> > Andreas
>> > Am 30.03.2011 um 22:02 schrieb Wiard Vasen:
>> >
>> > Dear sirs of Basex,
>> > I am doing my Master thesis on the letters of Vincent van Gogh at
>> the
>> > University of Amsterdam.
>> > For that purpose I use BaseX to analyze the letters.
>> > I wonder, is there the possibility to generate a tf/idf score
>> automatically?
>> > In your faq I noticed there needs to be a special term like 'SET
>> SCORING 0'
>> >  to be able to get a tf/idf score.
>> > This information I get from the following
>> > page: http://docs.basex.org/wiki/Full-Text
>> > Could you help me with this?
>> > I would be very grateful.
>> > Kind regards,
>> > _______________________________________________
>> > BaseX-Talk mailing list
>> > BaseX-Talk@mailman.uni-konstanz.de
>> > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>> >
>> >
>> > _______________________________________________
>> > BaseX-Talk mailing list
>> > BaseX-Talk@mailman.uni-konstanz.de
>> > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>> >
>> >
>>
>
>
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] tf/idf