Hi,
In textmining, the 'idf' or inverse document frequency is defined as idf(term)=ln(ndocuments / ndocuments containing term). I am working on a function that should return this idf.
This function:
declare function local:wordFreq_idf($nodes as node()*) as array(*) { let $count := count($nodes) let $text := for $node in $nodes return $node/text() => tokenize() => distinct-values() let $idf := $text => tidyTM:wordCount_arr() return $idf };
returns:
["probleem", 703] ["opgelost.", 248] ["dictu", 235] ["opgelost", 217] ["medewerker", 193] ...
For "probleem", the idf should be calculated as ln($count/703). Since there are 1780 nodes this would result in 0.929011751. I tried to exten the 'let $idf' line with: => array:for-each(function($idf) {array:append($idf, math:log($count div $idf[2]) )}) which should result in ["probleem", 703, 0.929011751]
but no mather what I do, every time I get this error: [XPTY0004] Cannot promote (array(xs:anyAtomicType))+ to array(*): ([ "probleem", 703 ], [ "opgelost.", 248 ], ...).
Is it possible to apply array:for-each on an array of arrays?
Ben
Hi Ben - I'm on mobile, please excuse any typos.
Maybe `return array { $idf }` is closer?
Untested, apologies! Best, Bridger
On Mon, Mar 30, 2020, 5:16 PM Ben Engbers Ben.Engbers@be-logical.nl wrote:
Hi,
In textmining, the 'idf' or inverse document frequency is defined as idf(term)=ln(ndocuments / ndocuments containing term). I am working on a function that should return this idf.
This function:
declare function local:wordFreq_idf($nodes as node()*) as array(*) { let $count := count($nodes) let $text := for $node in $nodes return $node/text() => tokenize() => distinct-values() let $idf := $text => tidyTM:wordCount_arr() return $idf };
returns:
["probleem", 703] ["opgelost.", 248] ["dictu", 235] ["opgelost", 217] ["medewerker", 193] ...
For "probleem", the idf should be calculated as ln($count/703). Since there are 1780 nodes this would result in 0.929011751. I tried to exten the 'let $idf' line with: => array:for-each(function($idf) {array:append($idf, math:log($count div $idf[2]) )}) which should result in ["probleem", 703, 0.929011751]
but no mather what I do, every time I get this error: [XPTY0004] Cannot promote (array(xs:anyAtomicType))+ to array(*): ([ "probleem", 703 ], [ "opgelost.", 248 ], ...).
Is it possible to apply array:for-each on an array of arrays?
Ben
On Mon, Mar 30, 2020 at 11:16:23PM +0200, Ben Engbers scripsit: [snip]
For "probleem", the idf should be calculated as ln($count/703). Since there are 1780 nodes this would result in 0.929011751. I tried to exten the 'let $idf' line with: => array:for-each(function($idf) {array:append($idf, math:log($count div $idf[2]) )}) which should result in ["probleem", 703, 0.929011751]
but no mather what I do, every time I get this error: [XPTY0004] Cannot promote (array(xs:anyAtomicType))+ to array(*): ([ "probleem", 703 ], [ "opgelost.", 248 ], ...).
The errors says you're trying to feed a sequence of arrays to an array function; maybe you want ! where you have => ?
-- Graydon
Op 31-03-2020 om 01:18 schreef Graydon:
On Mon, Mar 30, 2020 at 11:16:23PM +0200, Ben Engbers scripsit: [snip]
For "probleem", the idf should be calculated as ln($count/703). Since there are 1780 nodes this would result in 0.929011751. I tried to exten the 'let $idf' line with: => array:for-each(function($idf) {array:append($idf, math:log($count div $idf[2]) )}) which should result in ["probleem", 703, 0.929011751]
but no mather what I do, every time I get this error: [XPTY0004] Cannot promote (array(xs:anyAtomicType))+ to array(*): ([ "probleem", 703 ], [ "opgelost.", 248 ], ...).
The errors says you're trying to feed a sequence of arrays to an array function; maybe you want ! where you have => ?
-- Graydon
Hi, Upon your remark about feeding a sequence of arrays, I first tried to apply 'for-each' instead of 'array:for-each'. Alas, that didn't help ;-(, the error was still the same. I then tried to understand what you mean with the '!'. In the book from Priscilla Walmsley, the ! is mentioned as a simple map operator. How is that related to this problem?
Cheers, Ben
On Tue, Mar 31, 2020 at 04:21:52PM +0200, Ben Engbers scripsit:
Op 31-03-2020 om 01:18 schreef Graydon:
On Mon, Mar 30, 2020 at 11:16:23PM +0200, Ben Engbers scripsit: [snip]
For "probleem", the idf should be calculated as ln($count/703). Since there are 1780 nodes this would result in 0.929011751. I tried to exten the 'let $idf' line with: => array:for-each(function($idf) {array:append($idf, math:log($count div $idf[2]) )}) which should result in ["probleem", 703, 0.929011751]
but no mather what I do, every time I get this error: [XPTY0004] Cannot promote (array(xs:anyAtomicType))+ to array(*): ([ "probleem", 703 ], [ "opgelost.", 248 ], ...).
The errors says you're trying to feed a sequence of arrays to an array function; maybe you want ! where you have => ?
Upon your remark about feeding a sequence of arrays, I first tried to apply 'for-each' instead of 'array:for-each'. Alas, that didn't help ;-(, the error was still the same.
array:for-each takes a single array and gives you back a new array based on what the anonymous function passed as the second parameter does to each member of the original array.
So you have to make sure you're feeding a single array to it. (and you're not; that's what the error message is telling you, you've got a sequence of arrays on the left of the => operator.)
I then tried to understand what you mean with the '!'. In the book from Priscilla Walmsley, the ! is mentioned as a simple map operator. How is that related to this problem?
=> means "take the thing on the left and substitute it for the first parameter of the function on the right, so
('weasels') => replace('weasels','mustelids') works
('weasels','badgers') => replace('weasels','mustelids') DOES NOT work
This is because a one-item sequence can be treated as the single string value the first parameter of replace() requires, but a greater-then-one-item sequence can't be. (This one gives you "item expected, sequence found" if you try it from the GUI.)
! means "take each item of the sequence on the left and pass it to the thing on the right in turn", so
('weasels','badgers') ! replace(.,'weasels','mustelids') works.
(note that replace() got its first parameter back as the context item dot.)
so if you take
=> array:for-each(function($idf) {array:append($idf,math:log($count div $idf[2]) )})
and replace it with ! array:for-each(.,function($idf) {array:append($idf,math:log($count div $idf[2]) )})
(note the context-item dot!)
you should at least get a different error message.
-- Graydon
Hi,
=> means "take the thing on the left and substitute it for the first parameter of the function on the right, so
I thought it meant "The first parameter on the right will be subsituted with the thing on the left"?
('weasels') => replace('weasels','mustelids') works
('weasels','badgers') => replace('weasels','mustelids') DOES NOT work
This is because a one-item sequence can be treated as the single string value the first parameter of replace() requires, but a greater-then-one-item sequence can't be. (This one gives you "item expected, sequence found" if you try it from the GUI.)
The following is quite similar to the 'piping' mechanism in R. I'll start experimenting with it.
Thanx, Ben
! means "take each item of the sequence on the left and pass it to the thing on the right in turn", so
('weasels','badgers') ! replace(.,'weasels','mustelids') works.
(note that replace() got its first parameter back as the context item dot.)
so if you take
=> array:for-each(function($idf) {array:append($idf,math:log($count div $idf[2]) )})
and replace it with ! array:for-each(.,function($idf) {array:append($idf,math:log($count div $idf[2]) )})
(note the context-item dot!)
you should at least get a different error message.
-- Graydon
Am 30.03.2020 um 23:16 schrieb Ben Engbers:
Hi,
In textmining, the 'idf' or inverse document frequency is defined as idf(term)=ln(ndocuments / ndocuments containing term). I am working on a function that should return this idf.
This function:
declare function local:wordFreq_idf($nodes as node()*) as array(*) { let $count := count($nodes) let $text := for $node in $nodes return $node/text() => tokenize() => distinct-values() let $idf := $text => tidyTM:wordCount_arr() return $idf };
returns:
["probleem", 703] ["opgelost.", 248] ["dictu", 235] ["opgelost", 217] ["medewerker", 193] ...
So does the working function return a sequence of arrays? That doesn't match the as array(*) return type declaration, it seems.
What does tidyTM:wordCount_arr() return, a single array (of atomic items)?
Hi,
For (my personal) clarity, I have split up the original function in two parts:
declare function local:step_one($nodes as node()*) as array(*)* { let $text := for $node in $nodes return $node/text() => tokenize() => distinct-values() let $idf := $text => tidyTM:wordCount_arr() return $idf };
In local:step_one(), I first create a sequence with the distinct tokens for each $node. All the sequences are joined in $text. I then call wordCount_arr to count the occurences of each word in $text:
declare function tidyTM:wordCount_arr( $Words as xs:string*) as array(*) { for $w in $Words let $f := $w group by $f order by count($w) descending return ([$f, count($w)]) } ;
I would say that tidyTM:wordCount_arr returns a sequence of arrays but I am not certain if I have specified the correct return-type?
Calling local:step_one(tidyTM:remove_Stopwords($nodes, "Stp", $Stoppers)) returns: ["probleem", 703] ["opgelost.", 248] ....
I had hoped that calling the following local:wordFreq, would add the idf to each element but instead I get an error
declare function local:wordFreq_idf($nodes as node()*) as array(*) { let $count := count($nodes) let $idf := local:step_one($nodes) let $result := for-each( $idf, function($z) {array:append ($z, math:log($count div $z(2) ) ) } ) return $result }; [XPTY0004] Cannot promote (array(xs:anyAtomicType))+ to array(*): $idf := ([ "probleem", 703 ], [ "opgelost.", 248 ], ...).
Cheers, Ben
Op 31-03-2020 om 16:29 schreef Martin Honnen:
So does the working function return a sequence of arrays? That doesn't match the as array(*) return type declaration, it seems.
What does tidyTM:wordCount_arr() return, a single array (of atomic items)?
basex-talk@mailman.uni-konstanz.de