Hello,
I have been going thru and comparing different Native XML Databases and so far I am liking BaseX. However, there are still a few questions unanswered, before I make a final choice:
1. I have 1000s of XML files (each between 50MB-400MB) and this is going to grow exponentially (~200 / per day). So, my question is how scalable is BaseX ? Can I configure it to use data from my external HDD, in my initial prototype ?
2. I plan to heavily use XPATH, for data retrieval. Does BaseX, use any multi-processing, multi-threading to speed up search ? Any concurrent processing ?
3. Can I do some post-processing on searched and retrieved data ? Like sorting, unique elements etc ?
- Mansi
Dear Mansi,
- I have 1000s of XML files (each between 50MB-400MB) and this is going to
grow exponentially (~200 / per day). So, my question is how scalable is BaseX ? Can I configure it to use data from my external HDD, in my initial prototype ?
So this means you want to add appr. 40 gb of XML files per day, right, amounting to 14 tb/year? This sounds quite a lot indeed. You can have a look at our statistics page [1]; it gives you some insight into the current limits of BaseX.
However, all limits are per single database. You can distribute your data in multiple databases and address multiple databases with a single XPath/XQuery request. For example, you could create a new database every day and run a query over all these databases:
for $db in db:list() return db:open($db)/path/to/your/data
- I plan to heavily use XPATH, for data retrieval. Does BaseX, use any
multi-processing, multi-threading to speed up search ? Any concurrent processing ?
Read-only requests will automatically be multithreaded. If a single query leads to heavy I/O requests, it may be that single threaded processing wlil give you better results (because hard drives are often not very good in reading data in parallel).
- Can I do some post-processing on searched and retrieved data ? Like
sorting, unique elements etc ?
With XQuery (3.0), you can do virtually anything with your data. In most of our data-driven scenarios, all data processing is completely done in BaseX. Some plain examples can be found in our Wiki [2].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/XQuery_3.0
Thanks Christian.
re: size of data, I am hoping some days would be quieter than discussed below. But, yes its going to be a lot of data.
I just created a single Database with ~190 XML files of size 8.5 GB total. Activated indexes as well. Creating database using basexgui took close to an hour. Running a simple XQUERY took ~3 min. Database was created on an external USB 3.0 HDD. I will obviously be creating new databases across drives (if this POC is successful, will surely go for cloud) to scale it.
For time being, any and all tips are welcomes to optimize performance.
May be I will soon contribute to the statistics pages :)
- Mansi
On Tue, Oct 7, 2014 at 5:35 AM, Christian Grün christian.gruen@gmail.com wrote:
Dear Mansi,
- I have 1000s of XML files (each between 50MB-400MB) and this is going
to
grow exponentially (~200 / per day). So, my question is how scalable is BaseX ? Can I configure it to use data from my external HDD, in my
initial
prototype ?
So this means you want to add appr. 40 gb of XML files per day, right, amounting to 14 tb/year? This sounds quite a lot indeed. You can have a look at our statistics page [1]; it gives you some insight into the current limits of BaseX.
However, all limits are per single database. You can distribute your data in multiple databases and address multiple databases with a single XPath/XQuery request. For example, you could create a new database every day and run a query over all these databases:
for $db in db:list() return db:open($db)/path/to/your/data
- I plan to heavily use XPATH, for data retrieval. Does BaseX, use any
multi-processing, multi-threading to speed up search ? Any concurrent processing ?
Read-only requests will automatically be multithreaded. If a single query leads to heavy I/O requests, it may be that single threaded processing wlil give you better results (because hard drives are often not very good in reading data in parallel).
- Can I do some post-processing on searched and retrieved data ? Like
sorting, unique elements etc ?
With XQuery (3.0), you can do virtually anything with your data. In most of our data-driven scenarios, all data processing is completely done in BaseX. Some plain examples can be found in our Wiki [2].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/XQuery_3.0
I just created a single Database with ~190 XML files of size 8.5 GB total. Activated indexes as well. Creating database using basexgui took close to an hour. Running a simple XQUERY took ~3 min. Database was created on an external USB 3.0 HDD. I will obviously be creating new databases across drives (if this POC is successful, will surely go for cloud) to scale it.
For time being, any and all tips are welcomes to optimize performance.
Indeed performance should be much better if databases are created and queried on HDs or SSDs. Feel free to send us your queries if execution time is not good enough.
May be I will soon contribute to the statistics pages :)
Thanks, Christian
Christian,
So, going ahead with my POC and use cases we plan to solve, I have a few more database architecture questions..
1. Is there a way, we can have a table, with multiple columns. One of the column would be "ID" and others would be different XML information for that ID.
2. Can I map above table, with a relational table to perform join queries on "ID".
Thanks, - Mansi
On Wed, Oct 8, 2014 at 12:53 PM, Christian Grün christian.gruen@gmail.com wrote:
I just created a single Database with ~190 XML files of size 8.5 GB
total.
Activated indexes as well. Creating database using basexgui took close
to an
hour. Running a simple XQUERY took ~3 min. Database was created on an external USB 3.0 HDD. I will obviously be creating new databases across drives (if this POC is successful, will surely go for cloud) to scale it.
For time being, any and all tips are welcomes to optimize performance.
Indeed performance should be much better if databases are created and queried on HDs or SSDs. Feel free to send us your queries if execution time is not good enough.
May be I will soon contribute to the statistics pages :)
Thanks, Christian
Hi Mansi,
Out of interest: why don't you simply store all documents in the database and use the document path as ID?
as BaseX is a native XML store, there is no way to store data in structures like tables. However, due to the flexibility of XML structures, the usual way is to create another document or database that contains ID and additional meta data.
Best, Christian
On Fri, Oct 10, 2014 at 10:31 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
Out of interest: why don't you simply store all documents in the database and use the document path as ID?
I am storing deeply nested hierarchal data in XML files. Simply put, most of my queries are going to be relative (For e..g //@name). So, I am assuming it would be a huge performance hit. Specially, when I know each "ID" will most definitely have multiple XML documents. Correct me, if I am wrong here.
as BaseX is a native XML store, there is no way to store data in structures like tables. However, due to the flexibility of XML structures, the usual way is to create another document or database that contains ID and additional meta data.
I don't know, if I follow you completely here. Is there some metadata information which I can use, which maps each XML file stored in NXD to another relational database you discussed above, which I can use for mapping ?
Best, Christian
Hi Mansi,
I am storing deeply nested hierarchal data in XML files. Simply put, most of my queries are going to be relative (For e..g //@name). So, I am assuming it would be a huge performance hit. Specially, when I know each "ID" will most definitely have multiple XML documents. Correct me, if I am wrong here.
Usually, one identifier (ID) exists to reference one data entity (document, row, etc.). You say that more than one document will be assigned to a single ID, so you seem to work with 1:n relationships. For what does your ID stand for?
The notion of tables stems from the relational database world. In XML, you work with documents and collections, so it's a well-established procedure to reference documents by their database path. 1:n relationships can e.g. be represented in another database, which would contain a document with the following structure:
<docs> <doc id='0'>a.xml</doc> <doc id='0'>b.xml</doc> </docs>
A query like the following one could be used to address these documents:
for $doc in db:open('db-ids')//doc[@id = '0'] return db:open('db', $doc)
If your id database is indexed, access will be very fast (within fragments of milliseconds).
I agree it takes some time to understand the logics that I sketched out here, but once you are into it, it works out perfectly fine.
Hope this helps Christian
Christian,
I am absolutely new to database in general, though I understand relational database well enough. So, I appreciate your patience in walking me through this. Is there some book/resource you can point me to, which helps better visualize NXD ?
I follow your below email, but what I need to do is something like under:
1. Database, would have bunch of XML files associated to a particular "ID". I will make sure each start with a difference root. 2. Create a xml mapping file as you discussed below. 3. In one of my use case, I need XPATH query to return me all "ID", which matches XPATH. I am not able to go ahead with this use case.
Can you help me design this use case ?
- Mansi
On Mon, Oct 13, 2014 at 7:03 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
I am storing deeply nested hierarchal data in XML files. Simply put,
most of
my queries are going to be relative (For e..g //@name). So, I am
assuming it
would be a huge performance hit. Specially, when I know each "ID" will
most
definitely have multiple XML documents. Correct me, if I am wrong here.
Usually, one identifier (ID) exists to reference one data entity (document, row, etc.). You say that more than one document will be assigned to a single ID, so you seem to work with 1:n relationships. For what does your ID stand for?
The notion of tables stems from the relational database world. In XML, you work with documents and collections, so it's a well-established procedure to reference documents by their database path. 1:n relationships can e.g. be represented in another database, which would contain a document with the following structure:
<docs> <doc id='0'>a.xml</doc> <doc id='0'>b.xml</doc> </docs>
A query like the following one could be used to address these documents:
for $doc in db:open('db-ids')//doc[@id = '0'] return db:open('db', $doc)
If your id database is indexed, access will be very fast (within fragments of milliseconds).
I agree it takes some time to understand the logics that I sketched out here, but once you are into it, it works out perfectly fine.
Hope this helps Christian
I am trying to distribute data across multiple databases. I can't distribute based on day, as there could very well be situation, where single day's data could more than capacity of BaseX DB. From statistics page, only other way, which I can distribute is based on "number of nodes". But going with that, I am not able to find a way, I can get hold of a way to access "no of nodes" programmatically in a db. Further, I am clueless, if I can even find no of nodes of current doc to be imported.
So,
currentDocToImport = a.xml ??NodeNo(a.xml)
NumberOfNodes("LastDB") = ??
Do you guys agree if this is even a way to go ? Can someone give me pointers on how to find above 2 values ? Any other thoughts are always welcomed ...
- Mansi
On Tue, Oct 7, 2014 at 5:35 AM, Christian Grün christian.gruen@gmail.com wrote:
Dear Mansi,
- I have 1000s of XML files (each between 50MB-400MB) and this is going
to
grow exponentially (~200 / per day). So, my question is how scalable is BaseX ? Can I configure it to use data from my external HDD, in my
initial
prototype ?
So this means you want to add appr. 40 gb of XML files per day, right, amounting to 14 tb/year? This sounds quite a lot indeed. You can have a look at our statistics page [1]; it gives you some insight into the current limits of BaseX.
However, all limits are per single database. You can distribute your data in multiple databases and address multiple databases with a single XPath/XQuery request. For example, you could create a new database every day and run a query over all these databases:
for $db in db:list() return db:open($db)/path/to/your/data
- I plan to heavily use XPATH, for data retrieval. Does BaseX, use any
multi-processing, multi-threading to speed up search ? Any concurrent processing ?
Read-only requests will automatically be multithreaded. If a single query leads to heavy I/O requests, it may be that single threaded processing wlil give you better results (because hard drives are often not very good in reading data in parallel).
- Can I do some post-processing on searched and retrieved data ? Like
sorting, unique elements etc ?
With XQuery (3.0), you can do virtually anything with your data. In most of our data-driven scenarios, all data processing is completely done in BaseX. Some plain examples can be found in our Wiki [2].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/XQuery_3.0
Hi Mansi,
Is there some book/resource you can point me to, which helps better visualize NXD ?
sorry for letting you wait. If you want to know more about native XML databases, I recommend you to have a closer look at various articles in our Wiki (e. g. [1,2]). It will also be helpful if you get into the basics of XQuery [3].
Have you tried to realize some of the hints I gave in my previous mails?
I am trying to distribute data across multiple databases. I can't distribute based on day, as there could very well be situation, where single day's data could more than capacity of BaseX DB.
If 2 billion XML nodes per day are not enough, you will probably need to create more than one database per day. Via the "info db" command, you see how many nodes are currently stored in a database, but there is no cheap solution to find out the number of nodes of an incoming document, because XML documents can be very heterogeneous. Some questions back:
* Do you have some more information on the data you want to store? * Are all documents similar or do they vary greatly? If the documents are somewhat similar, you can usually estimate the number of nodes by looking at the byte size. * Do you know that you will really need to store lots of terabytes of XML data, or it is more like a theoretical assumption?
Christian
[1] http://docs.basex.org/wiki/Database [2] http://docs.basex.org/wiki/Table_of_Contents [3] http://docs.basex.org/wiki/Xquery
Christian,
Thanks for all your responses. It truly helps a lot.
re: Importing data into databases: I realized, for the extent of this POC, I will just count no of docs in each database (currently programmed to be 50) and keep creating new databases. Structure of data is same, but its nested in nature. Like a folder can have folder, which can have file etc. Usually, it won't be more than 4 levels deep. Thats a good tip, to guess no of nodes based on byte size. I guess, for time being I will move on, with just storing 50 docs per DB.
re: terabytes of data. Well, I am planning on using ~6 months worth of data for any analysis and discarding data prior to that (leaving it around in backups). Obviously, would be going some cloud route for such resources, will see how much budget I can manage to get :) Am very positive about this. So, no its not only a theoretical assumption as far as I can see.
re: Currently, I am looking into querying these databases. I am exploring REST for it. From documentation, it seems our only option is supporting these queries (on server side) using XQUERY or RestXQ, no Java/Python ? I am well versed with XPATH and XSLT, gearing up towards XQUERY now. But, would be a little easier (just my personal preference :)) to manipulate data in Java/Python before serving it back to client. Is there any such facility ? Something like:
"http://localhost:8984/rest?run=getData.java"
similarly for python ?
- Mansi
Some preliminary statistics: Imported 2050 XML documents in 22 min (including indexing on attributes).
On Sun, Oct 19, 2014 at 6:14 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
Is there some book/resource you can point me to, which helps better
visualize NXD ?
sorry for letting you wait. If you want to know more about native XML databases, I recommend you to have a closer look at various articles in our Wiki (e. g. [1,2]). It will also be helpful if you get into the basics of XQuery [3].
Have you tried to realize some of the hints I gave in my previous mails?
I am trying to distribute data across multiple databases. I can't
distribute
based on day, as there could very well be situation, where single day's
data
could more than capacity of BaseX DB.
If 2 billion XML nodes per day are not enough, you will probably need to create more than one database per day. Via the "info db" command, you see how many nodes are currently stored in a database, but there is no cheap solution to find out the number of nodes of an incoming document, because XML documents can be very heterogeneous. Some questions back:
- Do you have some more information on the data you want to store?
- Are all documents similar or do they vary greatly? If the documents
are somewhat similar, you can usually estimate the number of nodes by looking at the byte size.
- Do you know that you will really need to store lots of terabytes of
XML data, or it is more like a theoretical assumption?
Christian
[1] http://docs.basex.org/wiki/Database [2] http://docs.basex.org/wiki/Table_of_Contents [3] http://docs.basex.org/wiki/Xquery
basex-talk@mailman.uni-konstanz.de