Hi Mansi,
Good to know! Keep us updated.
Christian
On Wed, Oct 22, 2014 at 8:13 PM, Mansi Sheth mansi.sheth@gmail.com wrote:
Christian,
Actually, I am all set. I will just query using python or Angular or whatever and do any data manipulation there and use XQUERY just to query and no further processing.
Btw, some initial very informal statistics:
Took, 22 min to import ~2.1k documents with indexing as well. ~2 min for query to return back.
Impressive !!!
Machine specs: 16 GB RAM, 2.7 GHz, i7 processor MacBook Pro.
I am waiting on my colleague to get me some more production data, which gives me access to some 10k XML files. Will keep you posted.
- Mansi
On Wed, Oct 22, 2014 at 12:04 PM, Mansi Sheth mansi.sheth@gmail.com wrote:
Christian,
Thanks for all your responses. It truly helps a lot.
re: Importing data into databases: I realized, for the extent of this POC, I will just count no of docs in each database (currently programmed to be 50) and keep creating new databases. Structure of data is same, but its nested in nature. Like a folder can have folder, which can have file etc. Usually, it won't be more than 4 levels deep. Thats a good tip, to guess no of nodes based on byte size. I guess, for time being I will move on, with just storing 50 docs per DB.
re: terabytes of data. Well, I am planning on using ~6 months worth of data for any analysis and discarding data prior to that (leaving it around in backups). Obviously, would be going some cloud route for such resources, will see how much budget I can manage to get :) Am very positive about this. So, no its not only a theoretical assumption as far as I can see.
re: Currently, I am looking into querying these databases. I am exploring REST for it. From documentation, it seems our only option is supporting these queries (on server side) using XQUERY or RestXQ, no Java/Python ? I am well versed with XPATH and XSLT, gearing up towards XQUERY now. But, would be a little easier (just my personal preference :)) to manipulate data in Java/Python before serving it back to client. Is there any such facility ? Something like:
"http://localhost:8984/rest?run=getData.java"
similarly for python ?
- Mansi
Some preliminary statistics: Imported 2050 XML documents in 22 min (including indexing on attributes).
On Sun, Oct 19, 2014 at 6:14 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
Is there some book/resource you can point me to, which helps better visualize NXD ?
sorry for letting you wait. If you want to know more about native XML databases, I recommend you to have a closer look at various articles in our Wiki (e. g. [1,2]). It will also be helpful if you get into the basics of XQuery [3].
Have you tried to realize some of the hints I gave in my previous mails?
I am trying to distribute data across multiple databases. I can't distribute based on day, as there could very well be situation, where single day's data could more than capacity of BaseX DB.
If 2 billion XML nodes per day are not enough, you will probably need to create more than one database per day. Via the "info db" command, you see how many nodes are currently stored in a database, but there is no cheap solution to find out the number of nodes of an incoming document, because XML documents can be very heterogeneous. Some questions back:
- Do you have some more information on the data you want to store?
- Are all documents similar or do they vary greatly? If the documents
are somewhat similar, you can usually estimate the number of nodes by looking at the byte size.
- Do you know that you will really need to store lots of terabytes of
XML data, or it is more like a theoretical assumption?
Christian
[1] http://docs.basex.org/wiki/Database [2] http://docs.basex.org/wiki/Table_of_Contents [3] http://docs.basex.org/wiki/Xquery
--
- Mansi
--
- Mansi