Op 02-03-2020 om 13:27 schreef Christian Grün:
Hi Ben,
Here is an alternative version that, as I believe, should match your requirements better:
let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)
There is no need to remove nbsp substrings as they’ll never occur in your input, and the ft:tokenize function will ensure that your input (case, special characters, diacritics) will be normalized (see [1,2] for more details). Using functx is perfectly valid; I only removed the reference to make the code a bit shorter.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text_Module#ft:tokenize [2] http://docs.basex.org/wiki/Full-Text
Hi Christian,
Since my primary goal for this is moment is to see how basex/XQuery can be used for full text analysis (and compare the results or needed efforts with similar tasks in R), I am very glad that you brought the fn:tokenize() function to my attention!
Ben
PS, Just for fun, I created a repository with this tiny function: declare function tidyTM:wordFreqs( $Words as xs:string*) { for $w in $Words let $f := $w group by $f order by count($w) descending return ($f, count($w)) } ;
It took less than 10 minutes to create a repository and populate with this function. Creating a R-package takes much longer time!!!