Re: [basex-talk] handling large files: is there a streamingsolution?

13 Feb 2013

      Hi Fabrice and list

I am dealing with data-centric XML rather than documents and so there
is a fairly high node to content ratio.  I have about 250 million
nodes and I find that having about 15 million nodes per database
seems to work well, but this is just a guesstimate and I am really
looking for some performance profiles or some heuristics so that I
can limit the numbers of nodes in each database before the
performance degrades.

Cheers

Peter
...
---- Original Message ----
From: fetanchaud@questel.com
To: pw@themail.co.uk, fetanchaud@questel.com,
BaseX-Talk@mailman.uni-konstanz.de
Subject: RE: [basex-talk] handling large files: is there a
streamingsolution?
Date: Tue, 12 Feb 2013 09:07:40 +0000
...
Dear Peter,
I'm just a BaseX user, and Christian's team will correct me, but
from my experience, document size does not matter, at least for
querying.
Why do you talk about distributing data ? Did you reach the 2
billion nodes limit ?
As BaseX indexes all nodes, depending on the values distribution,
creating a new collection containing hand made indices can speed up
your queries.
For example, for append only collections, I'm used to creating a
index collection like this :
<index>
  <item value='value to be indexed'>
      the 'pre' pointer to the indexed element
  </tem>
  <item>...
</index>
And access that 'index' something like this :
for $i in 
  //item[@value='searched value']
return
  db:open-pre('mydb', $i)
And a big number of documents may slow down the properties window
display in  the GUI, because of the document tree view.
Question to the BaseX 's team : would 'user defined' indices be a
interesting feature ?
Regards
-----Message d'origine-----
De : pw@themail.co.uk [mailto:pw@themail.co.uk] 
Envoyé : lundi 11 février 2013 17:13
À : Fabrice Etanchaud; pw@themail.co.uk;
BaseX-Talk@mailman.uni-konstanz.de
Objet : RE: [basex-talk] handling large files: is there a
streamingsolution?
Thanks Fabrice, I am making good progress following your advice.  Do
you have any heuristics for the best way to distribute data for
performant searches and subsetting of data?  Am I better having lots
of small files or a few large files in a collection?
...
---- Original Message ----
From: fetanchaud@questel.com
To: pw@themail.co.uk, BaseX-Talk@mailman.uni-konstanz.de
Subject: RE: [basex-talk] handling large files: is there a 
streamingsolution?
Date: Mon, 11 Feb 2013 14:38:54 +0000
...
Dear Peter,
Did you try to create a collection with the files (CREATE command)
?
...
...
You should start that way,  I don't see the point in using file:
module for import.
I think that once in the database, file size does not matter
(until
you reach millions of file in the collection, and do a lot of
document 
related operations (list, etc...))
-----Message d'origine-----
De : basex-talk-bounces@mailman.uni-konstanz.de
[mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de 
pw@themail.co.uk
Envoyé : lundi 11 février 2013 15:33
À : BaseX-Talk@mailman.uni-konstanz.de
Objet : [basex-talk] handling large files: is there a streaming
solution?
Hello List
I am wanting to do a join with some large (3-400Mb) XML files and
would appreciate guidance on the optimal strategy.
At present these files are on the filesystem and not in a database
Is there any equivalent to the Zorba streaming xml:parse()?
Would loading the files into a database directly be the approach,
or
is it better to split them into smaller files?
Is the file: module a suitable route through which to import the
files?
Thanks for your help
Peter
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Re: [basex-talk] handling large files: is there a streamingsolution?

pw＠themail.co.uk