Hi Erol,
BaseX has it's own full-text search engine. However, it can use the stemming classes for different languages provided by Lucene.
Best regards, Dimitar
Am Mittwoch 13 April 2011, 21:51:12 schrieb Erol Akarsu:
I have another question.
Does Basex use Apache Lucene for full text search or have its own search engine?
Thanks
On Mon, Apr 11, 2011 at 12:35 PM, Erol Akarsu eakarsu@gmail.com wrote:
Ok,
I was able to run full text search with another XML Database.
I am primarily interested in how Basex will play with big XML file Wikipedia.
Actually, Database create of wikipedia is fine. But when I add full tet search and indexes for it, it always throws out of memory exception error.
I have changed -Xmx with 6GB that is still not enough to generate indexes for Wikipedia.
Can you help me on how to generate indexes with a machine that case 6GB for Basex process?
Thanks
Erol Akarsu
On Sun, Apr 10, 2011 at 4:07 AM, Andreas Weiler <
andreas.weiler@uni-konstanz.de> wrote:
Hi,
the following query could work for you:
declare namespace w="http://www.mediawiki.org/xml/export-0.5/"; for $i in doc("enwiki-latest-pages-articles")//w:sitename return $i[. contains text "Wikipedia"]/..
-- Andreas
Am 09.04.2011 um 20:43 schrieb Erol Akarsu:
Hi
I am having difficulty in running full text operators. This script gives siteinfo below declare namespace w="http://www.mediawiki.org/xml/export-0.5/";
let $d := fn:doc ("enwiki-latest-pages-articles")
return ($d//w:siteinfo)[1]
But return $d//w:siteinfo[w:sitename contains text 'Wikipedia'] does NOT give same node Why "contains" ft operator behave incorrectly? I remember it was working fine. I just dropped and recreated database and turn all indexes. Can you help me? Query info is here:
Query: declare namespace w="http://www.mediawiki.org/xml/export-0.5/"; Compiling:
- pre-evaluating fn:doc("enwiki-latest-pages-articles")
- adding text() step
- optimizing descendant-or-self step(s)
- removing path with no index results
- pre-evaluating (())[1]
- binding static variable $res
- pre-evaluating fn:doc("enwiki-latest-pages-articles")
- binding static variable $d
- adding text() step
- optimizing descendant-or-self step(s)
- removing path with no index results
- simplifying flwor expression
Result: ()
Timing:
- Parsing: 0.46 ms
- Compiling: 0.42 ms
- Evaluating: 0.17 ms
- Printing: 0.1 ms
- Total Time: 1.15 ms
Query plan:
<sequence size="0"/>
<siteinfo xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance">
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.17wmf1</generator> <case>first-letter</case> <namespaces>
<namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter"/> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> <namespace key="100" case="first-letter">Portal</namespace> <namespace key="101" case="first-letter">Portal talk</namespace> <namespace key="108" case="first-letter">Book</namespace> <namespace key="109" case="first-letter">Book talk</namespace>
</namespaces>
</siteinfo>
On Mon, Apr 4, 2011 at 7:31 AM, Erol Akarsu eakarsu@gmail.com wrote:
I imported wikipedia xml into basex and tried to search it.
But searching it takes longer.
I tried to search one element that is first child of whole document and it took 52 sec. I know the XML file is very big 31GB. How can I optimize the search?
declare namespace w="http://www.mediawiki.org/xml/export-0.5/";
let $d := fn:doc ("enwiki-latest-pages-articles")//w:siteinfo return $d
Database info:
open enwiki-latest-pages-articles
Database 'enwiki-latest-pages-articles' opened in 778.49 ms.
info database
Database Properties
Name: enwiki-latest-pages-articles Size: 23356 MB Nodes: 228090153 Height: 6
Database Creation
Path: /mnt/hgfs/C/tmp/enwiki-latest-pages-articles.xml Time Stamp: 03.04.2011 12:29:15 Input Size: 30025 MB Encoding: UTF-8 Documents: 1 Whitespace Chopping: ON Entity Parsing: OFF
Indexes
Up-to-date: true Path Summary: ON Text Index: ON Attribute Index: ON Full-Text Index: OFF
Timing info:
Query: declare namespace w="http://www.mediawiki.org/xml/export-0.5/"; Compiling:
- pre-evaluating fn:doc("enwiki-latest-pages-articles")
- optimizing descendant-or-self step(s)
- binding static variable $d
- removing variable $d
- simplifying flwor expression
Result: element siteinfo { ... }
Timing:
- Parsing: 1.4 ms
- Compiling: 52599.0 ms
- Evaluating: 0.28 ms
- Printing: 0.62 ms
- Total Time: 52601.32 ms
Query plan:
<DBNode name="enwiki-latest-pages-articles" pre="5"/>
Result of query:
<siteinfo xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance">
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.17wmf1</generator> <case>first-letter</case> <namespaces>
<namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter"/> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> <namespace key="100" case="first-letter">Portal</namespace> <namespace key="101" case="first-letter">Portal talk</namespace> <namespace key="108" case="first-letter">Book</namespace> <namespace key="109" case="first-letter">Book talk</namespace>
</namespaces>
</siteinfo>
Thanks
Erol Akarsu
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk