 
            Hi Bridger,
Thank you for this tip. It looks like it might apply only to adding new documents, whereas my main problem at the moment is reindexing existing documents, but I will look into it further.
Thanks, Greg
From: Bridger Dyson-Smith bdysonsmith@gmail.com Date: Thursday, March 14, 2024 at 6:43 PM To: Murray, Gregory gregory.murray@ptsem.edu Cc: Christian Grün christian.gruen@gmail.com, basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Out of Main Memory You don't often get email from bdysonsmith@gmail.com. Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification Hi Greg, Have you tried experimenting with the ADDCACHE[1] option when building your database? While it's been a bit, I recall having good results with, especially in a RAM-constrained environment. Hope that's helpful! Best, Bridger
[1] https://docs.basex.org/wiki/Options#ADDCACHE
On Thu, Mar 14, 2024 at 9:55 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.
So far, I have loaded 12,331 documents, containing a total of 2,196,771 pages. The total size of those XML documents on disk is 4.7GB. But that is only a fraction of the total number of documents I want to load into BaseX. The total number is more like 160,000 documents. Assuming that the documents I’ve loaded so far are a representative sample, and I believe that’s true, then the total size of the XML documents on disk, prior to loading them into BaseX, would be about 4.7GB * 13 = 61.1GB.
Normally the OCR text, once loaded, almost never changes. But the metadata fields do change as corrections are made. Also we add more XML documents routinely as we digitize more books over time. Therefore updates and additions are commonplace, such that keeping indexes up to date is important, to allow full-text searches to stay performant. I’m wondering if there are techniques for optimizing such quantities of text.
Thanks, Greg
From: Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> Date: Thursday, March 14, 2024 at 8:48 AM To: Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> Cc: basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Out of Main Memory Hi Greg,
A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).
How large is the total size of your input documents?
Best, Christian
[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing
On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Hello,
I’m working with a database that has a full-text index. I have found that if I iteratively add XML documents, then optimize, add more documents, optimize again, and so on, eventually the “optimize” command will fail with “Out of Main Memory.” I edited the basex startup script to change the memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses up some of it. I have found that if I exit memory-hungry programs (web browser, Oxygen), start basex, and then run the “optimize” command, I still get “Out of Main Memory.” I’m wondering if there are any known workarounds or strategies for this situation. If I understand the documentation about indexes correctly, index data is periodically written to disk during optimization. Does this mean that running optimize again will pick up where the previous attempt left off, such that running optimize repeatedly will eventually succeed?
Thanks, Greg
Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary