New subject: Full-text index with lots of data

20 Oct 2015

      Sorry for not using "Reply All" earlier.
Setting FTINDEXSPLITSIZE to 20000000 enabled the process to get a little
further, if the meaning of each dot is the same. FTINDEXSPLITSIZE at
default:
..............................|..................................................................|..........................................................................|...............................................................................|..
FTINDEXSPLITSIZE = 20000000
.......|.......|........|.......|......|........|.............|.............|.............|.............|.............|.............|.............|.............|..............|.............|.............|.............|.............|.............|.............|............
If it's a matter of making the indexing process take longer, that's not a
problem.
Thanks,
Chuck
On Tue, Oct 20, 2015 at 1:27 PM, Chuck Bearden cfbearden@gmail.com wrote:
...
Thanks Christian, I'll try the FTINDEXSPLITSIZE option.
I'm also open to modifying the XML files it that would help. Because
of limitations of the service from which we harvest them RESTfully, I
have only 20 actual content elements in each file. If you think it
would make a difference, I could consolidate them to have, say, 200 or
500 of the actual content elements per file, but I have no idea if
that would change how the indexing falls out.
The files also have structures where some properties of each record
are each represented by a URL, and ID value, and a string. I could
XSLT the files to remove all but the string (human readable is better
for our purposes) to make them less verbose.
BaseX is really super for doing data quality assessments of the XML,
and if we could get full-text indexing working, it would speed things
up considerably. Thanks to you & your team for all the work you've put
in to the application!
Alles Gute
Chuck Bearden
On Tue, Oct 20, 2015 at 12:55 PM, Christian Grün
christian.gruen@gmail.com wrote:
...
I see; it seems that the index creation is failing at the very final
step, in which partial index structures, which are temporarily written
to disk, are merged.
You could either to increase Xmx even more (to 6 or 7G?). If this
doesn't work, you could try assign different values to the
FTINDEXSPLITSIZE option [1] (start e.g. with 20000000).
Sorry for the trouble. Feel free to keep me updated, maybe we find a
way to fix this,
Christian
[1] http://docs.basex.org/wiki/Options#FTINDEXSPLITSIZE
On Tue, Oct 20, 2015 at 7:48 PM, Chuck Bearden cfbearden@gmail.com
wrote:
...
...
Here's the stack trace:
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
...
...
...
create db pure_20151019 pure_20151019
Creating Database...
..;..;..;..;..;..;.;..;..;..;..;..;.;..;.;.;.;.....;.....;.....;......;.....;.....;.......;.;.;;.;.;;.;.;;.;.;;.;.;;.;.;;.;.................................................;..........................................................;..........................................................;..........................................................;..........................................................;..........................................................;...................................................
...
...
677584.62 ms (1435 MB)
Indexing Text...
...........................................................................................................................................................................................................................................................
...
...
98215794 operations, 178526.99 ms (1611 MB)
Indexing Attribute Values...
...........................................................................................................................................................................................................................................................
...
...
178304119 operations, 135613.26 ms (2005 MB)
Indexing Full-Text...
..............................|..................................................................|..........................................................................|...............................................................................|..java.lang.OutOfMemoryError:
...
...
Java heap space
    at org.basex.index.ft.FTList.next(FTList.java:93)
    at org.basex.index.ft.FTBuilder.merge(FTBuilder.java:236)
    at org.basex.index.ft.FTBuilder.write(FTBuilder.java:140)
    at org.basex.index.ft.FTBuilder.build(FTBuilder.java:85)
    at org.basex.index.ft.FTBuilder.build(FTBuilder.java:23)
    at org.basex.data.DiskData.createIndex(DiskData.java:187)
    at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:103)
    at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:91)
    at org.basex.core.cmd.CreateDB.run(CreateDB.java:104)
    at org.basex.core.Command.run(Command.java:398)
    at org.basex.core.Command.execute(Command.java:100)
    at org.basex.api.client.LocalSession.execute(LocalSession.java:132)
    at org.basex.api.client.Session.execute(Session.java:36)
    at org.basex.core.CLI.execute(CLI.java:103)
    at org.basex.core.CLI.execute(CLI.java:87)
    at org.basex.BaseX.console(BaseX.java:191)
    at org.basex.BaseX.<init>(BaseX.java:166)
    at org.basex.BaseX.main(BaseX.java:42)
org.basex.core.BaseXException: Out of Main Memory.
You can try to:

increase Java's heap size with the flag -Xmx<size>
deactivate the text and attribute indexes.
  at org.basex.core.Command.execute(Command.java:101)
  at org.basex.api.client.LocalSession.execute(LocalSession.java:132)
  at org.basex.api.client.Session.execute(Session.java:36)
  at org.basex.core.CLI.execute(CLI.java:103)
  at org.basex.core.CLI.execute(CLI.java:87)
  at org.basex.BaseX.console(BaseX.java:191)
  at org.basex.BaseX.<init>(BaseX.java:166)
  at org.basex.BaseX.main(BaseX.java:42)

Out of Main Memory.
You can try to:

increase Java's heap size with the flag -Xmx<size>
deactivate the text and attribute indexes.

...
d
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
...
...
Here's how the process looked in the output of 'ps -ef', in case
that's relevant:
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
...
...
cfbeard+  88769  88757 46 12:15 pts/7    00:00:24 java -cp
/home/cfbearden/opt/basex-8.3.0/BaseX.jar:/home/cfbearden/opt/basex-8.3.0/lib/basex-api-8.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/basex-xqj-1.5.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-codec-1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-fileupload-1.3.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-io-1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/igo-0.4.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/jansi-1.11.jar:/home/cfbearden/opt/basex-8.3.0/lib/javax.servlet-3.0.0.v201112011016.jar:/home/cfbearden/opt/basex-8.3.0/lib/jdom-1.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-continuation-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-http-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-io-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-security-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-server-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-servlet-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-util-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-webapp-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-xml-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jing-20091111.jar:/home/cfbearden/opt/basex-8.3.0/lib/jline-2.13.jar:/home/cfbearden/opt/basex-8.3.0/lib/jts-1.13.jar:/home/cfbearden/opt/basex-8.3.0/lib/lucene-stemmers-3.4.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/milton-api-1.8.1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/mime-util-2.1.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/slf4j-api-1.7.12.jar:/home/cfbearden/opt/basex-8.3.0/lib/slf4j-simple-1.7.12.jar:/home/cfbearden/opt/basex-8.3.0/lib/tagsoup-1.2.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/xmldb-api-1.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/xml-resolver-1.2.jar:/home/cfbearden/opt/basex-8.3.0/lib/xqj2-0.2.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/xqj-api-1.0.jar:
...
...
-Xmx4g org.basex.BaseX -d
On Tue, Oct 20, 2015 at 12:38 PM, Chuck Bearden cfbearden@gmail.com
wrote:
...
...
...
It hasn't failed yet; I've gotten the progress indicators, along with
the phases that have been completed:
Creating Database...
Indexing Text...
Indexing Attribute Values...
It's still working on "Indexing Full-Text...". I'll post whatever I
get when it fails. Maybe it won't this time :)
Chuck
On Tue, Oct 20, 2015 at 12:33 PM, Christian Grün
christian.gruen@gmail.com wrote:
...
...
Creating Database...
..;..;..;..;..;..;.;..;..
Do you get any output after this line (I would expected to see a stack
trace, or at least an error message…)?
...
Where 'pure_20151019' is both the name of the database and the
subdirectory where all my XML files are.
It could well be that I'm missing a crucial option; I'm still
relatively new to BaseX. It's great stuff, though.
Because of my employer's IT environment, I have to run my Linux
workstation in a VMWare VM, though I doubt that that makes a
difference.
Thanks,
Chuck
On Tue, Oct 20, 2015 at 11:15 AM, Christian Grün
christian.gruen@gmail.com wrote:
> Hi Chuck,
>
> Usually, 4G is more than enough to create a full-text index for 16G
of
...
...
...
...
...
> XML. Obviously, however, that's not the case for your input data.
You
...
...
...
...
...
> could try to distribute your documents in multiple database. As as
> alternative, we could have a look at your data and try to find out
> what's going wrong. You can also use the -d flag and send us the
stack
...
...
...
...
...
> trace.
>
> Best,
> Christian
>
>
> On Tue, Oct 20, 2015 at 4:19 PM, Chuck Bearden cfbearden@gmail.com
wrote:
...
...
...
...
...
>> Hi all,
>>
>> I have about 16G of XML data in about 52000 files, and I was
hoping to
...
...
...
...
...
>> build a full-text index over it. I've tried two approaches: enable
>> full-text indexing as I create the database and then loading the
data,
...
...
...
...
...
>> and creating the full-text index after loading the data. If I
enable
...
...
...
...
...
>> ADDCACHE and modify the basex shell script to use 4g of RAM
instead of
...
...
...
...
...
>> 512M, I have no problem loading the data. If I try to load with
>> FTINDEX or create the index afterward, the process runs out of
memory.
...
...
...
...
...
>>
>> I could believe that I'm overlooking some option that would make
this
...
...
...
...
...
>> possible, but I suspect I just have too much data. I welcome your
>> thoughts & suggestions.
>>
>> All the best,
>> Chuck Bearden

Re: [basex-talk] Full-text index with lots of data