Maximilian,
That’s exactly the solution I arrived at:
To create a where-used table over all the topics in my corpus I process each DITA map or topic that has a reference to a topic and for each reference construct a map entry that maps the target to the reference
(the key is the URI of the target document) captured as a map with an entry for the target topic and an entry for the list of references to that topic. (DITA maps are simply collections of links to topics, DITA maps, or non-DITA things, while topics may have
cross references (xref) or content references to other topics.)
That results in a map where each entry’s value is a sequence of maps where each map has one item in the “refs” field. I iterate over the where-used map to replace each entry’s sequence of maps with a single
map that has one entry for each kind of reference:
let $pointsToMeMap := map:merge(
for $key in map:keys($baseMap)
let $entry := $baseMap($key)
let $newEntry :=
map{
'topic' : $entry?topic,
'topicrefs' : $entry?topicrefs,
'xrefs' : $entry?xrefs,
'conrefs' : $entry?conrefs
}
return map{$key : $newEntry}
)
That process takes about 45 seconds for my corpus, which is pretty good (each kind of reference takes about 15 seconds to collect, so 45 seconds to record three types of references).
For the references I’m doing a proper resolution of each reference to its target document, so the result is 100% correct, as opposed to my earlier approach, which depended on filenames being unique (which
is definitely not true in my corpus even though it’s supposed to be per our local policies but it is definitely not a DITA requirement).
Obviously lookup in this where-used map will be very fast (I haven’t had a chance to measure the use of this map to do things like construct a link graph extending from some starting topic).
My next challenge is how best to persist this map in my database so I can do multiple ad-hoc queries from BaseX GUI without having to rebuild the table.
Obviously doing this through a more persistent application would be the normal solution but for now I’m just doing ad-hoc queries and don’t have available scope to build a more complete application.
Cheers,
E.
_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
LinkedIn | Twitter | YouTube | Facebook
From:
BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.de> on behalf of Maximilian Gärber <mgaerber@arcor.de>
Date: Tuesday, January 18, 2022 at 6:31 AM
To: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Where-Used: Performance Improvement Strategies?
[External Email]
Hi Eliot,
in similar cases, I've learned that building temporary maps is really fast.
So, instead of doing the retrieval and filtering in one step, I just
construct a map with a convenient key.
In the example, I want a list of categories for articles that could
exist in multiple sections (of a web site).
In a later step, I will just consult the map for the categories.
let $category-map := map:merge(
for $a in $all-sections//ProductItem
let $guid := $a/@Guid
group by $guid
return map:entry($guid,
<categories>{
let $cats := for $s in $a/parent::*/parent::Section
return $s/ShopCategoryId/text()
for $cat in distinct-values($cats)
return <_><id>{$cat}</id></_>
}</categories>
)
)
Best,
Max
Am Fr., 14. Jan. 2022 um 16:41 Uhr schrieb Eliot Kimber
<eliot.kimber@servicenow.com>:
>
> In the context of my 40K topic DITA corpus, I’m trying to build a “where used” report that finds, for each topic, the other topics that directly refer to the topic. I can do this by looking for the target topic’s filename in the values of @href attributes
in other topics (I’m taking advantage of a local rule we have where all topic filenames should be unique).
>
>
>
> My current naive approach is simply:
>
>
>
> $topics//*[tokenize(@href, '/') = $filename]
>
>
>
> Where $topics is the 40K topics.
>
>
>
> Based on profiling, the use of tokenize() is slightly faster than either matches() or contains(), but all forms take about 0.5 seconds per target topic, which is way too slow to make this practical in practice.
>
>
>
> So I’m trying to work out what my performance optimization strategies are in BaseX.
>
>
>
> In MarkLogic I would set up an index so I could do fast lookup of tokens in @href values or something similar (it’s been long enough since I had to optimize MarkLogic queries that I don’t remember the details but basically indexes for everything).
>
>
>
> I know I could do a one-time construction of the where-used table and then use that for quick lookup for subsequent queries but I’m trying to find a solution that is more appropriate for my current “create a new database with the latest files from git and
run some queries quickly to get a report” mode.
>
>
>
> I suspect that using full-text indexing may be a solution here but wondering what other performance optimization options I have for this kind of look up.
>
>
>
> Thinking about it now I definitely need to see if building the where-used table would actually be slower. That is, find every @href, resolve it and construct a map of topics to href elements that point to that topic. Hmm.
>
>
>
> Anyway, any guidance on this challenge would be appreciated.
>
>
>
> Cheers,
>
>
>
> Eliot
>
>
>
> _____________________________________________
>
> Eliot Kimber
>
> Sr Staff Content Engineer
>
> O: 512 554 9368
>
> M: 512 554 9368
>
> servicenow.com
>
> LinkedIn | Twitter | YouTube | Facebook