Hello BaseX list,
I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.
My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files
The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.
Indexing the aforementioned amount of data should be achievable in reasonable time, say: - initial indexing may last some days, if necessary - incremental(?) indexing of new data should be an overnight job
Can I give BaseX a try? Or should I look elsewhere?
Cheers, Matthias
Hi Matthias,
Can I give BaseX a try?
You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.
What’s the total file size of your initial TEI documents?
Best, Christian
[1] https://docs.basex.org/wiki/Statistics
On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:
Hello BaseX list,
I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.
My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files
The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.
Indexing the aforementioned amount of data should be achievable in reasonable time, say:
- initial indexing may last some days, if necessary
- incremental(?) indexing of new data should be an overnight job
Can I give BaseX a try? Or should I look elsewhere?
Cheers, Matthias
Hi Christian,
we're talking about ~150GB for the initial TEI docs.
Well, with this promising answer, I go ahead. We'll meet again :-)
Matthias
Am Donnerstag, 3. September 2020, 19:10:54 CEST schrieb Christian Grün:
Hi Matthias,
Can I give BaseX a try?
You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.
What’s the total file size of your initial TEI documents?
Best, Christian
[1] https://docs.basex.org/wiki/Statistics
On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:
Hello BaseX list,
I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.
My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files
The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.
Indexing the aforementioned amount of data should be achievable in reasonable time, say:
- initial indexing may last some days, if necessary
- incremental(?) indexing of new data should be an overnight job
Can I give BaseX a try? Or should I look elsewhere?
Cheers, Matthias
Hello list, hello Christian,
since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so! My first one consists of about 3.8 mio files in roughly 25GB.
Creating this first database took about 70 minutes, including full-text index. Searching for "Konstanz" in this dataset yields 6200 hits in 400ms.
Wow, quite impressive! Really.
BTW, this is the corresponding XQuery I tried: declare variable $b := 'Konstanz'; for $t in collection("Korpus01")//*[./text() contains text {$b}] return <p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>
Ok, this is promising, indeed. So I tried to meet my next goal: 10mio. files, ~70GB of disk space. Bad luck: creating the database fails because of too less memory while building full-text index. Since memory is limited, I did not try to increase the java memory option further (which actually is "-Xmx3g"). But instead I tried the other way round: creating additional databases. This process was as fast as in the first step, for each of them. BaseX is fun...
But now, at this point, the hurdles are too high, at least for me. According to https://docs.basex.org/wiki/Databases#Access_Resources%5B1] I modified the XQuery: declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in collection($c)//*[./text() contains text {$b}] return <p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>
gives results, but lasts orders of magnitude longer than for just one database: 14000 hits in 690000ms.
What's wrong with my approach: The XQuery I applied? Or my expectation, having comparable fast results with full-text searches in multiple databases?
Thanks again Matthias
Hi Matthias,
Can I give BaseX a try?
You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.
What’s the total file size of your initial TEI documents?
Best, Christian
[1] https://docs.basex.org/wiki/Statistics
On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:
Hello BaseX list,
I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.
My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files
The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.
Indexing the aforementioned amount of data should be achievable in reasonable time, say:
- initial indexing may last some days, if necessary
- incremental(?) indexing of new data should be an overnight job
Can I give BaseX a try? Or should I look elsewhere?
Cheers, Matthias
-------- [1] https://docs.basex.org/wiki/Databases#Access_Resources
Hi Matthias,
since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so!
Glad to hear!
I modified the XQuery: ... gives results, but lasts orders of magnitude longer than for just one database:
If a query is run on a single database, this database will be opened at compile-time, and available indexes will be checked. If the full-text index exists, your query will be rewritten to take advantage of the index structure.
If multiple databases are accessed in an iteration, you can e.g. give the query optimizer a hint that all databases will have up-to-date index structures. This can be done with the “enforceindex” pragma [1]:
declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in (# db:enforceindex #) { db:open($c)//*[./text() contains text {$b}] } return <p>{ ft:extract($t[./text() contains text {$b}]/text(), 'b', 155) }</p>
If you use the BaseX GUI, you can open the Info View and check the output. If it outputs “apply full-text index”, you’ll know that the index is utilized. In the Info View, you’ll also see the optimized query string. It will give you some hints which other optimizations were applied to your input query. If full-text queries get more complex, it’s sometimes more convenient to directly use ft:search, as this function allows you to specify variable arguments, e.g. for wildcard or fuzzy searches.
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings [2] https://docs.basex.org/wiki/Full-Text_Module#ft:search
Hi Christian,
thanks for pointing to "ft:search", that's much easier to understand for me than using the enforceindex pragma (which yielded 0 matches, btw).
I'm ending with something like declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in ft:search($c, $b)/parent::* return <p>{ ft:extract($t[./text() contains text {$b}]/text(), 'b', 155) }</p>
Searching multiple databases in parallel - 19000 hits in 840ms - very nice!
Thanks again for your patient help Matthias
Am Donnerstag, 10. September 2020, 08:30:37 CEST schrieb Christian Grün:
Hi Matthias,
since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so!
Glad to hear!
I modified the XQuery: ... gives results, but lasts orders of magnitude longer than for just one database:
If a query is run on a single database, this database will be opened at compile-time, and available indexes will be checked. If the full-text index exists, your query will be rewritten to take advantage of the index structure.
If multiple databases are accessed in an iteration, you can e.g. give the query optimizer a hint that all databases will have up-to-date index structures. This can be done with the “enforceindex” pragma [1]:
declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in (# db:enforceindex #) { db:open($c)//*[./text() contains text {$b}] } return <p>{ ft:extract($t[./text() contains text {$b}]/text(), 'b', 155) }</p>
If you use the BaseX GUI, you can open the Info View and check the output. If it outputs “apply full-text index”, you’ll know that the index is utilized. In the Info View, you’ll also see the optimized query string. It will give you some hints which other optimizations were applied to your input query. If full-text queries get more complex, it’s sometimes more convenient to directly use ft:search, as this function allows you to specify variable arguments, e.g. for wildcard or fuzzy searches.
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings [2] https://docs.basex.org/wiki/Full-Text_Module#ft:search
basex-talk@mailman.uni-konstanz.de