Hello --
So my overall goal is to take a bunch of XML, mark all the (generally phrasal) terms of art, take that modified content and mark all the (possibly phrasal) glossary terms, and then go through and remove all the glossary markers that happen to be inside terms of art and then remove all the term-of-art markers. (There's an intermediate step between "found all the possible glossary terms" and "have applied the glossary terms" where the list of candidate terms gets sent off for semantic approval, so the "find a term" steps and "change the documents in which the terms are found" steps have to be distinct.)
My initial problem was marking phrasal terms; the full-text index is very fast and solves the "this rapidly becomes a nightmare with regular expressions, especially regular expressions with no "whole words only" switch, problem, but it marks every word in the phrasal term individually.
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
(: db:create("DB", <para id="GUID-12354" >Diverse and various words, some of which are going to be tagged for review as glossary terms.</para>, 'test.xml', map { 'ftindex': true() }) :)
(: example phrasal term :) let $term as xs:string := 'Diverse and various'
for $ft in (db:open('DB')//*[text() contains text { $term } phrase using case sensitive]) return <changed>{ let $contents as node()+ := ft:mark($ft[text() contains text { $term } phrase using case sensitive],'mark') return element {name($contents)} { $contents/@*, (: has to handle hyphens as well as spaces :) for tumbling window $w in $contents/node() start $s when true() end $e previous $eprev next $enext when ( $enext[not(self::mark)] and ($enext[normalize-space()][not(matches(.,'^-$'))]) ) or ($enext[self::mark] and $e[normalize-space()][not(matches(.,'^-$'))]) return if ($w[self::mark]) then <mark>{string-join($w,'')}</mark> else $w } }</changed>
thanks! Graydon
On Sat, 2020-04-25 at 13:46 -0400, Graydon Saunders wrote:
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
I just search for the multi-word phrase and surround that. Enclosed is a sample from a prototype for a keyword in context search index for fromoldbooks.org (not yet live). Lookognow i see it's not very neat but maybe it'll give some ideas.
let $results := ( let $matches := $doc//p[not(ancestor::longdesc) and (.//text() contains text { $term })] for $match at $pos in $matches let $sock := <singleton>{ft:mark( $match[text() contains text { $term }] , 'sock')}</singleton>, (: "sock" is now a singleton element likely containing a p or title element, : with every phrase matching the query surrounded with a sock element. :) $longbefore := concat( " ", local:ws( string-join(($sock//sock)[1]/preceding-sibling::node(), '' ))), $before := replace($longbefore, " $", " "), $after:= local:ws( string-join( ($sock//sock)[last()]/following-sibling::node(), '')), $uri := document-uri( $match/ancestor::document-node() ), $image := $match/ancestor::image where not( empty(($sock//sock))) return <details id="d{$pos}" data-group="{ ($image/@source) }"> <summary class="{if ($image) then 'image' else 'text'}{ if ($image and ($image eq $matches[$pos - 1]/ancestor::image)) then ' same ' else ''}" > <before> { substring($before, string-length($before) - 100, string-length($before)) } </before> <match><b>{ string-join($sock//sock, ' ') }</b><after>{ local:trunc($after, 60) }</after> </match> </summary>
) return <results> { for $r in $results group by $g := $r/@data-group return <div class="group"> <p class="metadata">{ if ($g ne "") then let $source := $doc//source[@id = $g] return ( <a href="/{$g}/">{ $source/title/node() }</a>, if ($source/author and ($source/author ne "Anonymous")) then ", by " || $source/author else "", if ($source/date) then " (" || $source/date || ")" else "" ) else $r[1]//a[contains-token(@class, 'info')] }</p> {$r} </div> } </results>
On Sat, Apr 25, 2020 at 06:02:14PM -0400, Liam R. E. Quin scripsit:
On Sat, 2020-04-25 at 13:46 -0400, Graydon Saunders wrote:
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
I just search for the multi-word phrase and surround that. Enclosed is a sample from a prototype for a keyword in context search index for fromoldbooks.org (not yet live). Lookognow i see it's not very neat but maybe it'll give some ideas.
It does, but alas I can't use string-join. Some of the terms have hyphens, so I'm getting <mark>A</mark>-<mark>List</mark> coming out of the full text search, which must become <mark>A-List</mark>. Plus some of the terms have the form "nine-pence and six-pence", so any solution has to be general for interstitial text nodes.
(I can't rule out any punctuation. I know there are hyphens, but don't know there are ONLY hyphens.)
Thanks!
Graydon
Hi Graydon,
It’s a good idea to use the window clause (as the number of mark elements that need to be joined is not known in advance). You can use ft:tokenize to include other delimiters:
for $term in ('Diverse and various', 'words… some', 'glossary-terms') for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }]) return element { name($ft) } { $ft/@*, for tumbling window $w in $ft/node() start when true() end $e next $enext when ( $enext[not(self::mark)] and $enext[exists(ft:tokenize(.))] or $enext[self::mark] and $e[exists(ft:tokenize(.))] ) return if ($w[self::mark]) then <mark>{ string-join($w) }</mark> else $w }
If you don’t want to rebuild your original node, you can also use the 'update' expression and modify your existing document. I have slightly rewritten the original code, but the basic idea is the same:
for $term in ('Diverse and various', 'words… some', 'glossary-terms') for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }]) return $ft update { for tumbling window $w in node() start $s when $s/self::mark end $curr next $next when ( exists(ft:tokenize($curr)) and exists($next/self::mark) or exists(ft:tokenize($next)) and empty ($next/self::mark) ) return ( replace node head($w) with element mark { string-join($w) }, delete nodes tail($w) ) }
Hope this helps, Christian
On Sun, Apr 26, 2020 at 6:04 AM Graydon graydonish@gmail.com wrote:
On Sat, Apr 25, 2020 at 06:02:14PM -0400, Liam R. E. Quin scripsit:
On Sat, 2020-04-25 at 13:46 -0400, Graydon Saunders wrote:
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
I just search for the multi-word phrase and surround that. Enclosed is a sample from a prototype for a keyword in context search index for fromoldbooks.org (not yet live). Lookognow i see it's not very neat but maybe it'll give some ideas.
It does, but alas I can't use string-join. Some of the terms have hyphens, so I'm getting <mark>A</mark>-<mark>List</mark> coming out of the full text search, which must become <mark>A-List</mark>. Plus some of the terms have the form "nine-pence and six-pence", so any solution has to be general for interstitial text nodes.
(I can't rule out any punctuation. I know there are hyphens, but don't know there are ONLY hyphens.)
Thanks!
Graydon
Hi Christian --
Thank you! that helps a lot. I can maintain that, or rather, I won't have to maintain that, it's general enough to keep working.
Much appreciated! Graydon
On Sun, Apr 26, 2020 at 4:56 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
It’s a good idea to use the window clause (as the number of mark elements that need to be joined is not known in advance). You can use ft:tokenize to include other delimiters:
for $term in ('Diverse and various', 'words… some', 'glossary-terms') for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }]) return element { name($ft) } { $ft/@*, for tumbling window $w in $ft/node() start when true() end $e next $enext when ( $enext[not(self::mark)] and $enext[exists(ft:tokenize(.))] or $enext[self::mark] and $e[exists(ft:tokenize(.))] ) return if ($w[self::mark]) then <mark>{ string-join($w) }</mark> else $w }
If you don’t want to rebuild your original node, you can also use the 'update' expression and modify your existing document. I have slightly rewritten the original code, but the basic idea is the same:
for $term in ('Diverse and various', 'words… some', 'glossary-terms') for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }]) return $ft update { for tumbling window $w in node() start $s when $s/self::mark end $curr next $next when ( exists(ft:tokenize($curr)) and exists($next/self::mark) or exists(ft:tokenize($next)) and empty ($next/self::mark) ) return ( replace node head($w) with element mark { string-join($w) }, delete nodes tail($w) ) }
Hope this helps, Christian
On Sun, Apr 26, 2020 at 6:04 AM Graydon graydonish@gmail.com wrote:
On Sat, Apr 25, 2020 at 06:02:14PM -0400, Liam R. E. Quin scripsit:
On Sat, 2020-04-25 at 13:46 -0400, Graydon Saunders wrote:
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
I just search for the multi-word phrase and surround that. Enclosed is a sample from a prototype for a keyword in context search index for fromoldbooks.org (not yet live). Lookognow i see it's not very neat but maybe it'll give some ideas.
It does, but alas I can't use string-join. Some of the terms have hyphens, so I'm getting <mark>A</mark>-<mark>List</mark> coming out of the full text search, which must become <mark>A-List</mark>. Plus some of the terms have the form "nine-pence and six-pence", so any solution has to be general for interstitial text nodes.
(I can't rule out any punctuation. I know there are hyphens, but don't know there are ONLY hyphens.)
Thanks!
Graydon
basex-talk@mailman.uni-konstanz.de