Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is using exactly, but it seems fairly sophisticated (there's even an advanced option for multiple stemming: e.g., "further" has "far," "farther," "further" as stems).

All best,
Tim


--
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research
Yale University Library



On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith <bdysonsmith@gmail.com> wrote:
Hi Tim  -

On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson <timathom@gmail.com> wrote:
I'm currently involved in a project that's using MarkLogic, and I noticed that its implementation of English-language stemming differs from that of BaseX: e.g., "mouse" and "mice" both stem to "mouse."

In BaseX, those words are stemmed separately. Is this a known limitation of the internal English syntax parser?

It's my (admittedly, *VERY*) limited understanding that the BaseX stemmer, at least for English, is limited to the Porter Stemmer[1]. The Porter Stemmer just stems, and doesn't handle stemming from plurals to singulars in the case of apophonic plurals.

It'd be interesting to learn what stemmer(s) MarkLogic uses.

And, while I'm not that familiar with it (and it would probably entail significant work to implement), the `ft:thesaurus()` function provides similar functionality:
```
ft:thesaurus(
  <thesaurus>
    <entry>
      <term>mice</term>
      <synonym>
        <term>mouse</term>
        <relationship>NT</relationship>
      </synonym>
      <synonym>
        <term>rodent</term>
        <relationship>BTG</relationship>
      </synonym>
    </entry>
  </thesaurus>,
  'mice'
)
```
 
Example:

db:create("stem-test",
  <data>
    <x>mouse</x>
    <y>mice</y>
  </data>
  , "data", map {"ftindex": true(), "stemming": true(), "language": "en"}
)
,
update:output(
  ft:search("stem-test", "mice")  
)


Thanks,
Tim



Best,
Bridger

[1]  https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa51/basex-core/src/main/java/org/basex/util/ft/EnglishStemmer.java


--
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research
Yale University Library