Hi Vincent,
My question is: what is an optimal maximum database size (in gigabytes)? I am hoping for some general advice. I appreciate that the answer can vary depending various factors.
Thanks for the precise details on your input data. Indeed it’s difficult to give general advice on the best size of a database, because it largely depends on the structure and contents of your documents (see e.g. [1]). In our own applications, single databases are rarely larger than 10 GB, but I know that some people are storing more than 100 GB in single instances.
- Running OPTIMIZE or OPTIMIZE ALL on the larger databases results in an
out of memory error. I have switched to running CREATE INDEX to separately create text, attribute, token, and fulltext indexes, and found that creating these indexes separately produces fewer out of memory errors.
I would be interested in hearing which index structure causes most problems in your scenario. Do you use the full-text index? The value index structures should usually work fine with much less main memory assigned.
- The command scripts that run CREATE INDEX or OPTIMIZE (ALL) seem to tie
up the machine for a long time, maybe due to heavy disk access.
This is probably something that cannot be avoided. Did you try incremental indexing (via the UPDINDEX option)?
- As the database grows in size the rate at which documents are added slows
down. I have been measuring the number of documents imported, and observed rates over 100 documents per minute, and typical rates are around 60 – 30 documents per minute.
I saw that you use REPLACE instead of ADD. If you know that your added documents will be distinct, ADD will be much faster. You could additionally set AUTOFLUSH to false, and do explicit flushes in regular time intervals (or completely avoid it if robustness of your data is not first priority).
Hope this helps, Christian