Grouping words in phrasal matches with full-text indexes - BaseX-Talk - mailman.uni-konstanz.de

25 Apr 2020


      Hello --
So my overall goal is to take a bunch of XML, mark all the (generally
phrasal) terms of art, take that modified content and mark all the
(possibly phrasal) glossary terms, and then go through and remove all the
glossary markers that happen to be inside terms of art and then remove all
the term-of-art markers.  (There's an intermediate step between "found all
the possible glossary terms" and "have applied the glossary terms" where
the list of candidate terms gets sent off for semantic approval, so the
"find a term" steps and "change the documents in which the terms are found"
steps have to be distinct.)
My initial problem was marking phrasal terms; the full-text index is very
fast and solves the "this rapidly becomes a nightmare with regular
expressions, especially regular expressions with no "whole words only"
switch, problem, but it marks every word in the phrasal term individually.
I think I have figured out a way to connect the adjacent marked words in
the phrasal term into a single mark element. I cannot convince myself that
this is the right way; is there a better approach than tumbling windows?
(: db:create("DB", <para id="GUID-12354" >Diverse and various words, some
of which are going to be tagged for review as glossary terms.</para>,
'test.xml', map { 'ftindex': true() }) :)
(: example phrasal term :)
let $term as xs:string := 'Diverse and various'
for $ft in (db:open('DB')//*[text() contains text { $term } phrase using
case sensitive])
  return
    <changed>{
      let $contents as node()+ := ft:mark($ft[text() contains text { $term
} phrase using case sensitive],'mark')
      return element {name($contents)} {
        $contents/@*,
        (: has to handle hyphens as well as spaces :)
        for tumbling window $w in $contents/node()
        start $s when true()
        end $e previous $eprev next $enext
          when ( $enext[not(self::mark)] and
($enext[normalize-space()][not(matches(.,'^-$'))]) )
            or ($enext[self::mark] and
$e[normalize-space()][not(matches(.,'^-$'))])
        return if ($w[self::mark]) then <mark>{string-join($w,'')}</mark>
else $w
      }
  }</changed>
thanks!
Graydon