Sorry for not using "Reply All" earlier.
Setting FTINDEXSPLITSIZE to 20000000 enabled the process to get a little further, if the meaning of each dot is the same. FTINDEXSPLITSIZE at default:
..............................|..................................................................|..........................................................................|...............................................................................|..
FTINDEXSPLITSIZE = 20000000
.......|.......|........|.......|......|........|.............|.............|.............|.............|.............|.............|.............|.............|..............|.............|.............|.............|.............|.............|.............|............
If it's a matter of making the indexing process take longer, that's not a problem.
Thanks, Chuck
On Tue, Oct 20, 2015 at 1:27 PM, Chuck Bearden cfbearden@gmail.com wrote:
Thanks Christian, I'll try the FTINDEXSPLITSIZE option.
I'm also open to modifying the XML files it that would help. Because of limitations of the service from which we harvest them RESTfully, I have only 20 actual content elements in each file. If you think it would make a difference, I could consolidate them to have, say, 200 or 500 of the actual content elements per file, but I have no idea if that would change how the indexing falls out.
The files also have structures where some properties of each record are each represented by a URL, and ID value, and a string. I could XSLT the files to remove all but the string (human readable is better for our purposes) to make them less verbose.
BaseX is really super for doing data quality assessments of the XML, and if we could get full-text indexing working, it would speed things up considerably. Thanks to you & your team for all the work you've put in to the application!
Alles Gute Chuck Bearden
On Tue, Oct 20, 2015 at 12:55 PM, Christian Grün christian.gruen@gmail.com wrote:
I see; it seems that the index creation is failing at the very final step, in which partial index structures, which are temporarily written to disk, are merged.
You could either to increase Xmx even more (to 6 or 7G?). If this doesn't work, you could try assign different values to the FTINDEXSPLITSIZE option [1] (start e.g. with 20000000).
Sorry for the trouble. Feel free to keep me updated, maybe we find a way to fix this, Christian
[1] http://docs.basex.org/wiki/Options#FTINDEXSPLITSIZE
On Tue, Oct 20, 2015 at 7:48 PM, Chuck Bearden cfbearden@gmail.com
wrote:
Here's the stack trace:
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
create db pure_20151019 pure_20151019
Creating Database...
..;..;..;..;..;..;.;..;..;..;..;..;.;..;.;.;.;.....;.....;.....;......;.....;.....;.......;.;.;;.;.;;.;.;;.;.;;.;.;;.;.;;.;.................................................;..........................................................;..........................................................;..........................................................;..........................................................;..........................................................;...................................................
677584.62 ms (1435 MB) Indexing Text...
...........................................................................................................................................................................................................................................................
98215794 operations, 178526.99 ms (1611 MB) Indexing Attribute Values...
...........................................................................................................................................................................................................................................................
178304119 operations, 135613.26 ms (2005 MB) Indexing Full-Text...
..............................|..................................................................|..........................................................................|...............................................................................|..java.lang.OutOfMemoryError:
Java heap space at org.basex.index.ft.FTList.next(FTList.java:93) at org.basex.index.ft.FTBuilder.merge(FTBuilder.java:236) at org.basex.index.ft.FTBuilder.write(FTBuilder.java:140) at org.basex.index.ft.FTBuilder.build(FTBuilder.java:85) at org.basex.index.ft.FTBuilder.build(FTBuilder.java:23) at org.basex.data.DiskData.createIndex(DiskData.java:187) at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:103) at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:91) at org.basex.core.cmd.CreateDB.run(CreateDB.java:104) at org.basex.core.Command.run(Command.java:398) at org.basex.core.Command.execute(Command.java:100) at org.basex.api.client.LocalSession.execute(LocalSession.java:132) at org.basex.api.client.Session.execute(Session.java:36) at org.basex.core.CLI.execute(CLI.java:103) at org.basex.core.CLI.execute(CLI.java:87) at org.basex.BaseX.console(BaseX.java:191) at org.basex.BaseX.<init>(BaseX.java:166) at org.basex.BaseX.main(BaseX.java:42) org.basex.core.BaseXException: Out of Main Memory. You can try to:
- increase Java's heap size with the flag -Xmx<size>
- deactivate the text and attribute indexes. at org.basex.core.Command.execute(Command.java:101) at org.basex.api.client.LocalSession.execute(LocalSession.java:132) at org.basex.api.client.Session.execute(Session.java:36) at org.basex.core.CLI.execute(CLI.java:103) at org.basex.core.CLI.execute(CLI.java:87) at org.basex.BaseX.console(BaseX.java:191) at org.basex.BaseX.<init>(BaseX.java:166) at org.basex.BaseX.main(BaseX.java:42)
Out of Main Memory. You can try to:
- increase Java's heap size with the flag -Xmx<size>
- deactivate the text and attribute indexes.
d
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
Here's how the process looked in the output of 'ps -ef', in case that's relevant:
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
cfbeard+ 88769 88757 46 12:15 pts/7 00:00:24 java -cp
/home/cfbearden/opt/basex-8.3.0/BaseX.jar:/home/cfbearden/opt/basex-8.3.0/lib/basex-api-8.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/basex-xqj-1.5.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-codec-1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-fileupload-1.3.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-io-1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/igo-0.4.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/jansi-1.11.jar:/home/cfbearden/opt/basex-8.3.0/lib/javax.servlet-3.0.0.v201112011016.jar:/home/cfbearden/opt/basex-8.3.0/lib/jdom-1.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-continuation-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-http-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-io-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-security-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-server-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-servlet-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-util-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-webapp-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-xml-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jing-20091111.jar:/home/cfbearden/opt/basex-8.3.0/lib/jline-2.13.jar:/home/cfbearden/opt/basex-8.3.0/lib/jts-1.13.jar:/home/cfbearden/opt/basex-8.3.0/lib/lucene-stemmers-3.4.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/milton-api-1.8.1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/mime-util-2.1.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/slf4j-api-1.7.12.jar:/home/cfbearden/opt/basex-8.3.0/lib/slf4j-simple-1.7.12.jar:/home/cfbearden/opt/basex-8.3.0/lib/tagsoup-1.2.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/xmldb-api-1.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/xml-resolver-1.2.jar:/home/cfbearden/opt/basex-8.3.0/lib/xqj2-0.2.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/xqj-api-1.0.jar:
-Xmx4g org.basex.BaseX -d
On Tue, Oct 20, 2015 at 12:38 PM, Chuck Bearden cfbearden@gmail.com
wrote:
It hasn't failed yet; I've gotten the progress indicators, along with the phases that have been completed:
Creating Database... Indexing Text... Indexing Attribute Values...
It's still working on "Indexing Full-Text...". I'll post whatever I get when it fails. Maybe it won't this time :)
Chuck
On Tue, Oct 20, 2015 at 12:33 PM, Christian Grün christian.gruen@gmail.com wrote:
Creating Database... ..;..;..;..;..;..;.;..;..
Do you get any output after this line (I would expected to see a stack trace, or at least an error message…)?
Where 'pure_20151019' is both the name of the database and the subdirectory where all my XML files are.
It could well be that I'm missing a crucial option; I'm still relatively new to BaseX. It's great stuff, though.
Because of my employer's IT environment, I have to run my Linux workstation in a VMWare VM, though I doubt that that makes a difference.
Thanks, Chuck
On Tue, Oct 20, 2015 at 11:15 AM, Christian Grün christian.gruen@gmail.com wrote: > Hi Chuck, > > Usually, 4G is more than enough to create a full-text index for 16G
of
> XML. Obviously, however, that's not the case for your input data.
You
> could try to distribute your documents in multiple database. As as > alternative, we could have a look at your data and try to find out > what's going wrong. You can also use the -d flag and send us the
stack
> trace. > > Best, > Christian > > > On Tue, Oct 20, 2015 at 4:19 PM, Chuck Bearden cfbearden@gmail.com
wrote:
>> Hi all, >> >> I have about 16G of XML data in about 52000 files, and I was
hoping to
>> build a full-text index over it. I've tried two approaches: enable >> full-text indexing as I create the database and then loading the
data,
>> and creating the full-text index after loading the data. If I
enable
>> ADDCACHE and modify the basex shell script to use 4g of RAM
instead of
>> 512M, I have no problem loading the data. If I try to load with >> FTINDEX or create the index afterward, the process runs out of
memory.
>> >> I could believe that I'm overlooking some option that would make
this
>> possible, but I suspect I just have too much data. I welcome your >> thoughts & suggestions. >> >> All the best, >> Chuck Bearden