require performance improvement suggestion for large size xml

List overview All Threads
Download

newer

older

Path Summary order

BaseX 8.1: The Spring Edition

ankit kumar

31 Mar 2015 31 Mar '15

9:38 a.m.

Hi ,

I am using BaseX for my XML base application. In my application user provides various data filtration criteria. After passing the data through all the filters resultant data is processed for evaluation. For each filter we have created corresponding xquery and output of each filter is passed as input to the next filter for further filtration.

What I notice is query execution time except for the first query is very high. From query plan I observed that for the first query indexing was being applied but not for the rest. This may be the case as the subsequent queries where getting fired on the data(sequence created by preceding query) passed on by the previous filter. This was not observed when the xml file was of smaller size , but as I increased the size of xml to 300 MB each query started taking times in minutes.

I tried following thing

I fired all the filter queries on the database to take advantage of indexing and after every filter execution , I took intersection of output sequence generated by each filter to keep track of final filtered data. The problem with above solution is , intersection query between larger sequence of data started tacking time large amount of time.

Is there a way I can increase the performance.

Thanks,

Ankit

Attachments:

attachment.html (text/html — 1.8 KB)

Show replies by date

Dirk Kirsten

31 Mar 31 Mar

9:50 a.m.

New subject: require performance improvement suggestion for large size xml

Hello Ankit,

I am not sure how we are supposed to helo you here. Your conclusions are correct, but without any code it is impossible to give any valuable advice.

As you correctly stated, your performance will benefit if you use the index. Also, no index is used if you create data on the fly. The best solution might be to simply remove the multiple steps and write one efficient query which gives you the correct result. Or you might want to store your intermediate results in a databases (which is indexed, so can be queried fastly). After all, it all depends on your query and your problem and there is no rule to give you the best performance.

Please be also aware that this is in no way limited to BaseX or XML databases. The same thing would apply for relational databases, if you split data up to improve performance it will cost if you join data. Also, it is always faster to use indexes (after all, this is why they exist...). Also, there is no way to simply "make a SQL query faster" without seeing the actual query.

tl; dr Please provide your actual XQuery and same sample data.

Cheers Dirk On 03/31/2015 03:38 PM, ankit kumar wrote:

...

Hi ,

I am using BaseX for my XML base application. In my application user provides various data filtration criteria. After passing the data through all the filters resultant data is processed for evaluation. For each filter we have created corresponding xquery and output of each filter is passed as input to the next filter for further filtration.

What I notice is query execution time except for the first query is very high. From query plan I observed that for the first query indexing was being applied but not for the rest. This may be the case as the subsequent queries where getting fired on the data(sequence created by preceding query) passed on by the previous filter. This was not observed when the xml file was of smaller size , but as I increased the size of xml to 300 MB each query started taking times in minutes.

I tried following thing

I fired all the filter queries on the database to take advantage of indexing and after every filter execution , I took intersection of output sequence generated by each filter to keep track of final filtered data. The problem with above solution is , intersection query between larger sequence of data started tacking time large amount of time.

Is there a way I can increase the performance.

Thanks,

Ankit

-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

ankit kumar

10:07 a.m.

New subject: require performance improvement suggestion for large size xml

large.xml https://docs.google.com/file/d/0B_pB7l14skhVRmx0czRPUGFCUjA/edit?usp=drive_web Hi Dirk,

I am attaching sample data and query info with the mail. Queries which i am firing is given below.

My First Query =====================

/products/p:category/*[@catid]

Second Query ======================

declare namespace p = "a:b:c";

declare variable $ext_prods external; (: large data binds here :)

let $prods := /products/*[@catid];

let $diffCatId := distinct-values($prods/@catid) let $catGroup := db:open('large_products')/products/p:category[@id = $diffCatId] (: Index apply on@id :)

return $ext_prods[@catid = $catGroup/@id] (: No any indexing used on variable:)

====================================================

Thanks,

On 31 March 2015 at 19:20, Dirk Kirsten dk@basex.org wrote:

...

Hello Ankit,

I am not sure how we are supposed to helo you here. Your conclusions are correct, but without any code it is impossible to give any valuable advice.

As you correctly stated, your performance will benefit if you use the index. Also, no index is used if you create data on the fly. The best solution might be to simply remove the multiple steps and write one efficient query which gives you the correct result. Or you might want to store your intermediate results in a databases (which is indexed, so can be queried fastly). After all, it all depends on your query and your problem and there is no rule to give you the best performance.

Please be also aware that this is in no way limited to BaseX or XML databases. The same thing would apply for relational databases, if you split data up to improve performance it will cost if you join data. Also, it is always faster to use indexes (after all, this is why they exist...). Also, there is no way to simply "make a SQL query faster" without seeing the actual query.

tl; dr Please provide your actual XQuery and same sample data.

Cheers Dirk On 03/31/2015 03:38 PM, ankit kumar wrote:

...
Hi ,

I am using BaseX for my XML base application. In my application user provides various data filtration criteria. After passing the data through all the filters resultant data is processed for evaluation. For each filter we have created corresponding xquery and output of each filter is passed as input to the next filter for further filtration.

What I notice is query execution time except for the first query is very high. From query plan I observed that for the first query indexing was being applied but not for the rest. This may be the case as the subsequent queries where getting fired on the data(sequence created by preceding query) passed on by the previous filter. This was not observed when the xml file was of smaller size , but as I increased the size of xml to 300 MB each query started taking times in minutes.

I tried following thing

I fired all the filter queries on the database to take advantage of indexing and after every filter execution , I took intersection of output sequence generated by each filter to keep track of final filtered data. The problem with above solution is , intersection query between larger sequence of data started tacking time large amount of

time.

...
Is there a way I can increase the performance.

Thanks,

Ankit

-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

3797

Age (days ago)

3797

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

2 comments

2 participants

tags (0)

participants (2)

ankit kumar
Dirk Kirsten