big data performance

List overview All Threads
Download

newer

older

tail recursion issue in 9.4.3

creating epub and odf with bases

Matthias Schütze

3 Sep 2020 3 Sep '20

1:05 p.m.

Hello BaseX list,

I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.

My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files

The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.

Indexing the aforementioned amount of data should be achievable in reasonable time, say: - initial indexing may last some days, if necessary - incremental(?) indexing of new data should be an overnight job

Can I give BaseX a try? Or should I look elsewhere?

Cheers, Matthias

Show replies by date

Christian Grün

3 Sep 3 Sep

1:10 p.m.

Hi Matthias,

...

Can I give BaseX a try?

You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.

What’s the total file size of your initial TEI documents?

Best, Christian

[1] https://docs.basex.org/wiki/Statistics

On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:

...

Hello BaseX list,

I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.

My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files

The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.

Indexing the aforementioned amount of data should be achievable in reasonable time, say:

initial indexing may last some days, if necessary

incremental(?) indexing of new data should be an overnight job

Can I give BaseX a try? Or should I look elsewhere?

Cheers, Matthias

Matthias Schütze

1:25 p.m.

Hi Christian,

we're talking about ~150GB for the initial TEI docs.

Well, with this promising answer, I go ahead. We'll meet again :-)

Matthias

Am Donnerstag, 3. September 2020, 19:10:54 CEST schrieb Christian Grün:

...

Hi Matthias,

...
Can I give BaseX a try?

You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.

What’s the total file size of your initial TEI documents?

Best, Christian

[1] https://docs.basex.org/wiki/Statistics

On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:

...
Hello BaseX list,

I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.

My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files

The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.

Indexing the aforementioned amount of data should be achievable in reasonable time, say:

initial indexing may last some days, if necessary

incremental(?) indexing of new data should be an overnight job

Can I give BaseX a try? Or should I look elsewhere?

Cheers, Matthias

-- ----------------------------------------------- matthias.schuetze@web.de -----------------------------------------------

Matthias Schütze

9 Sep 9 Sep

2:28 p.m.

Hello list, hello Christian,

since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so! My first one consists of about 3.8 mio files in roughly 25GB.

Creating this first database took about 70 minutes, including full-text index. Searching for "Konstanz" in this dataset yields 6200 hits in 400ms.

Wow, quite impressive! Really.

BTW, this is the corresponding XQuery I tried: declare variable $b := 'Konstanz'; for $t in collection("Korpus01")//*[./text() contains text {$b}] return {ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}

Ok, this is promising, indeed. So I tried to meet my next goal: 10mio. files, ~70GB of disk space. Bad luck: creating the database fails because of too less memory while building full-text index. Since memory is limited, I did not try to increase the java memory option further (which actually is "-Xmx3g"). But instead I tried the other way round: creating additional databases. This process was as fast as in the first step, for each of them. BaseX is fun...

But now, at this point, the hurdles are too high, at least for me. According to https://docs.basex.org/wiki/Databases#Access_Resources%5B1] I modified the XQuery: declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in collection($c)//*[./text() contains text {$b}] return {ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}

gives results, but lasts orders of magnitude longer than for just one database: 14000 hits in 690000ms.

What's wrong with my approach: The XQuery I applied? Or my expectation, having comparable fast results with full-text searches in multiple databases?

Thanks again Matthias

...

Hi Matthias,

...
Can I give BaseX a try?

You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.

What’s the total file size of your initial TEI documents?

Best, Christian

[1] https://docs.basex.org/wiki/Statistics

On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:

...
Hello BaseX list,

I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.

My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files

The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.

Indexing the aforementioned amount of data should be achievable in reasonable time, say:

initial indexing may last some days, if necessary

incremental(?) indexing of new data should be an overnight job

Can I give BaseX a try? Or should I look elsewhere?

Cheers, Matthias

-------- [1] https://docs.basex.org/wiki/Databases#Access_Resources

Christian Grün

10 Sep 10 Sep

2:30 a.m.

Hi Matthias,

...

since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so!

Glad to hear!

...

I modified the XQuery: ... gives results, but lasts orders of magnitude longer than for just one database:

If a query is run on a single database, this database will be opened at compile-time, and available indexes will be checked. If the full-text index exists, your query will be rewritten to take advantage of the index structure.

If multiple databases are accessed in an iteration, you can e.g. give the query optimizer a hint that all databases will have up-to-date index structures. This can be done with the “enforceindex” pragma [1]:

declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in (# db:enforceindex #) { db:open($c)//*[./text() contains text {$b}] } return { ft:extract($t[./text() contains text {$b}]/text(), 'b', 155) }

If you use the BaseX GUI, you can open the Info View and check the output. If it outputs “apply full-text index”, you’ll know that the index is utilized. In the Info View, you’ll also see the optimized query string. It will give you some hints which other optimizations were applied to your input query. If full-text queries get more complex, it’s sometimes more convenient to directly use ft:search, as this function allows you to specify variable arguments, e.g. for wildcard or fuzzy searches.

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings [2] https://docs.basex.org/wiki/Full-Text_Module#ft:search

Matthias Schütze

9:11 a.m.

Hi Christian,

thanks for pointing to "ft:search", that's much easier to understand for me than using the enforceindex pragma (which yielded 0 matches, btw).

I'm ending with something like declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in ft:search($c, $b)/parent::* return { ft:extract($t[./text() contains text {$b}]/text(), 'b', 155) }

Searching multiple databases in parallel - 19000 hits in 840ms - very nice!

Thanks again for your patient help Matthias

Am Donnerstag, 10. September 2020, 08:30:37 CEST schrieb Christian Grün:

...

Hi Matthias,

...
since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so!

Glad to hear!

...
I modified the XQuery: ... gives results, but lasts orders of magnitude longer than for just one database:

If a query is run on a single database, this database will be opened at compile-time, and available indexes will be checked. If the full-text index exists, your query will be rewritten to take advantage of the index structure.

If multiple databases are accessed in an iteration, you can e.g. give the query optimizer a hint that all databases will have up-to-date index structures. This can be done with the “enforceindex” pragma [1]:

declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in (# db:enforceindex #) { db:open($c)//*[./text() contains text {$b}] } return { ft:extract($t[./text() contains text {$b}]/text(), 'b', 155) }

If you use the BaseX GUI, you can open the Info View and check the output. If it outputs “apply full-text index”, you’ll know that the index is utilized. In the Info View, you’ll also see the optimized query string. It will give you some hints which other optimizations were applied to your input query. If full-text queries get more complex, it’s sometimes more convenient to directly use ft:search, as this function allows you to specify variable arguments, e.g. for wildcard or fuzzy searches.

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings [2] https://docs.basex.org/wiki/Full-Text_Module#ft:search

Christian Grün

9:18 a.m.

...

thanks for pointing to "ft:search", that's much easier to understand for me than using the enforceindex pragma (which yielded 0 matches, btw).

Interesting. I’ll check if I get this reproduced. In the long term, however, I also think that ft:search will give you more flexibility.

1772

Age (days ago)

1779

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

6 comments

2 participants

tags (0)

participants (2)

Christian Grün
Matthias Schütze