Hello Bram,
IMHO the main argument for data/index separation is the ease of index recreation, and the ease of reindexation of your index database. Is there still a need for ad hoc indexing, now that BaseX let us index only a node name selection ? I guess you need to index computed values ?
For current BaseX limitations, you will find them in [1], but you might have already read that page. I hit the database node number limit once working with the European Patent Office DOCDB collection. So I had to set up a database naming politics to dispatch the documents.
Hoping it helps,
Best regards,
Fabrice Etanchaud Senior Data Specialist CERFrance PCH
[1] http://docs.basex.org/wiki/Statistics
De : BaseX-Talk [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Bram Vanroy Envoyé : jeudi 14 juin 2018 10:47 À : BaseX Objet : [basex-talk] Usage of doc's in BaseX
Dear BaseX team
I am planning an update on our previous custom indexing system [1]. But to do this I have a couple of questions. The major ones will be how to write an efficient custom indexing query in XQuery, but that'll be for another email. (In fact, we have a dual indexing system, so two index files per main file.) For now I am mainly interested in different documents in a single databases, and the doc() functionality.
Intuitively, I'd say that documents that are related to each other should be put in the same database. E.g. one database with different documents for plants, and one database with different documents for animals. But when I was scrolling through the documentation of BaseX, I noticed that when creating custom indices you do not put those in the same db as the original content, so you have on database for the content and one for the index [2]. Is this the way it's typically done?
More generally, the questions that I have are the following:
* What is the actual difference in BaseX between using separate documents in a single database, or using different databases all together?
* Is there a performance difference when I would put my index file in the same database as the content, vs. when using different databases altogether?
* What is the max allowed size for a document in a database and a database itself respectively? (I have files that are 100's of GB in size. It might not be plausible to have a file and its index file in the same database.)
Thank you in advance Kind regards
Bram Vanroy Doctoral Research at Ghent University, Belgium https://www.lt3.ugent.be/people/bram-vanroy/
[1] https://biblio.ugent.be/publication/8534144 [2] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures