Hello all
I have been reading through many emails on this list and I've learnt a lot about how Basex works and how others use it. A month or so back I have sent an email myself to this list concerning caching. Even though I have some more questions about that, I will leave that for another time. Today I am concerned about retrieving chunked input from Basex.
(Question also found on StackOverflow, with a nice bounty! :) http://stackoverflow.com/questions/36675388/efficient-and-user-friendly-way- to-present-slow-loading-results)
Case at hand: we use Basex to query a 50 million tokens corpus. We also make this available to other users through a website. The thing is that this is slow. For our own projects that's no problem, we dive straight into the back-end and run a search command from terminal and let the query run for all the time it needs. However, for users it is paramount that they get a quick response. At the moment it is taking too long. I don't blame BaseX. We love BaseX and are astounded by its efficiency and optimisations! However, we want to deliver the best user-experience to our users.
We call a new session from PHP, wait to receive the results, do some post-processing and then load the result page. As said, this takes too much time. We've been looking into some solutions. The best one that I think should be possible, is returning chunks of the results. Do you know those websites that allow you to see results but only, like, 20 per page? I think something similar is appropriate. When a user has searched for a pattern, we only show the 20 or so first results just so they can get an idea of the results they'd find. Then, when they click a button, we should query for the twenty next results which are then appended to the list (JavaScript solution I guess), and so on. Until all results have been found. Additionally, I will also provide a button from which users can download all results in a text file. This is allowed to take a longer time. The main thing is that users should get early feedback and results on their query.
Now the question is if something like this is possible in an efficient manner in BaseX. Can you form a query that only finds the 20 first results, and then the following 20 and so on - and is this faster than searching everything at once? In other words, when I am searching for the results 120-140 (after having pushed the button a couple of times), is BaseX smart enough to skip the search space it has already done to find the 120 previous hits? If so, that would be great. Could you help me on my way, with some PHP/XQuery code that is suitable?
I also highly encourage you to participate on StackOverflow. As I said, I am offering a 200 bounty - for the people who are interested in Internet fame. :)
Thank you for your time
Kind regards
Bram Vanroy
Hi Bram,
Thanks for your query. This is just a quick response, but you could have a look at the new %rest:single annotation in BaseX [1]. Your feedback is welcome.
Christian
[1] http://docs.basex.org/wiki/RESTXQ#Query_Execution
On Thu, Apr 21, 2016 at 3:11 PM, Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be wrote:
Hello all
I have been reading through many emails on this list and I’ve learnt a lot about how Basex works and how others use it. A month or so back I have sent an email myself to this list concerning caching. Even though I have some more questions about that, I will leave that for another time. Today I am concerned about retrieving chunked input from Basex.
(Question also found on StackOverflow, with a nice bounty! J http://stackoverflow.com/questions/36675388/efficient-and-user-friendly-way-...)
Case at hand: we use Basex to query a 50 million tokens corpus. We also make this available to other users through a website. The thing is that this is slow. For our own projects that’s no problem, we dive straight into the back-end and run a search command from terminal and let the query run for all the time it needs. However, for users it is paramount that they get a quick response. At the moment it is taking too long. I don’t blame BaseX. We love BaseX and are astounded by its efficiency and optimisations! However, we want to deliver the best user-experience to our users.
We call a new session from PHP, wait to receive the results, do some post-processing and then load the result page. As said, this takes too much time. We’ve been looking into some solutions. The best one that I think should be possible, is returning chunks of the results. Do you know those websites that allow you to see results but only, like, 20 per page? I think something similar is appropriate. When a user has searched for a pattern, we only show the 20 or so first results just so they can get an idea of the results they’d find. Then, when they click a button, we should query for the twenty next results which are then appended to the list (JavaScript solution I guess), and so on. Until all results have been found. Additionally, I will also provide a button from which users can download all results in a text file. This is allowed to take a longer time. The main thing is that users should get early feedback and results on their query.
Now the question is if something like this is possible in an efficient manner in BaseX. Can you form a query that only finds the 20 first results, and then the following 20 and so on – and is this faster than searching everything at once? In other words, when I am searching for the results 120-140 (after having pushed the button a couple of times), is BaseX smart enough to skip the search space it has already done to find the 120 previous hits? If so, that would be great. Could you help me on my way, with some PHP/XQuery code that is suitable?
I also highly encourage you to participate on StackOverflow. As I said, I am offering a 200 bounty – for the people who are interested in Internet fame. J
Thank you for your time
Kind regards
Bram Vanroy
PS: The current way to only return chunks of the result is to pass on the index to the first and last result in your result, and to do the filtering in XQuery:
$results[position() = $start to $end]
By returning one more result than requested, the client will know that there will be more results. This may be helpful, because computing the total result size is often much more expensive than returning only the first results.
Does this help? Christian
On Thu, Apr 21, 2016 at 3:18 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Bram,
Thanks for your query. This is just a quick response, but you could have a look at the new %rest:single annotation in BaseX [1]. Your feedback is welcome.
Christian
[1] http://docs.basex.org/wiki/RESTXQ#Query_Execution
On Thu, Apr 21, 2016 at 3:11 PM, Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be wrote:
Hello all
I have been reading through many emails on this list and I’ve learnt a lot about how Basex works and how others use it. A month or so back I have sent an email myself to this list concerning caching. Even though I have some more questions about that, I will leave that for another time. Today I am concerned about retrieving chunked input from Basex.
(Question also found on StackOverflow, with a nice bounty! J http://stackoverflow.com/questions/36675388/efficient-and-user-friendly-way-...)
Case at hand: we use Basex to query a 50 million tokens corpus. We also make this available to other users through a website. The thing is that this is slow. For our own projects that’s no problem, we dive straight into the back-end and run a search command from terminal and let the query run for all the time it needs. However, for users it is paramount that they get a quick response. At the moment it is taking too long. I don’t blame BaseX. We love BaseX and are astounded by its efficiency and optimisations! However, we want to deliver the best user-experience to our users.
We call a new session from PHP, wait to receive the results, do some post-processing and then load the result page. As said, this takes too much time. We’ve been looking into some solutions. The best one that I think should be possible, is returning chunks of the results. Do you know those websites that allow you to see results but only, like, 20 per page? I think something similar is appropriate. When a user has searched for a pattern, we only show the 20 or so first results just so they can get an idea of the results they’d find. Then, when they click a button, we should query for the twenty next results which are then appended to the list (JavaScript solution I guess), and so on. Until all results have been found. Additionally, I will also provide a button from which users can download all results in a text file. This is allowed to take a longer time. The main thing is that users should get early feedback and results on their query.
Now the question is if something like this is possible in an efficient manner in BaseX. Can you form a query that only finds the 20 first results, and then the following 20 and so on – and is this faster than searching everything at once? In other words, when I am searching for the results 120-140 (after having pushed the button a couple of times), is BaseX smart enough to skip the search space it has already done to find the 120 previous hits? If so, that would be great. Could you help me on my way, with some PHP/XQuery code that is suitable?
I also highly encourage you to participate on StackOverflow. As I said, I am offering a 200 bounty – for the people who are interested in Internet fame. J
Thank you for your time
Kind regards
Bram Vanroy
Hi Christian
I am coming back from you, replying to your StackOverflow reply.
I have implemented a system where on page load the ten first results are shown, and when a user clicks a button an Ajax request is sent, and the ten next results are looked up and appended to HTML. This works great, especially in cases where many hits are found. However, this is still quite slow when you are querying for rare cases. Therefore I am curious about your opinion on the following idea:
What if you set the difference between $start and $end to 1, and immediately return each hit, and fire the next ajax request? This would mean A LOT of session open / session close to basex, but it would result in a user experience where the user can see the results 'streaming in' as they are found. Is this feasible, or do you strongly advise against this - if so, why? My assumption is that there are many server queries indeed, but they all happen one at a time, so disk usage should be relatively normal. CPU usage is something else I guess.
What do you think?
Thanks
-----Oorspronkelijk bericht----- Van: Christian Grün [mailto:christian.gruen@gmail.com] Verzonden: donderdag 21 april 2016 15:22 Aan: Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be CC: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Querying Basex from a web interface
PS: The current way to only return chunks of the result is to pass on the index to the first and last result in your result, and to do the filtering in XQuery:
$results[position() = $start to $end]
By returning one more result than requested, the client will know that there will be more results. This may be helpful, because computing the total result size is often much more expensive than returning only the first results.
Does this help? Christian
On Thu, Apr 21, 2016 at 3:18 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Bram,
Thanks for your query. This is just a quick response, but you could have a look at the new %rest:single annotation in BaseX [1]. Your feedback is welcome.
Christian
[1] http://docs.basex.org/wiki/RESTXQ#Query_Execution
On Thu, Apr 21, 2016 at 3:11 PM, Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be wrote:
Hello all
I have been reading through many emails on this list and I’ve learnt a lot about how Basex works and how others use it. A month or so back I have sent an email myself to this list concerning caching. Even though I have some more questions about that, I will leave that for another time. Today I am concerned about retrieving chunked input from Basex.
(Question also found on StackOverflow, with a nice bounty! J http://stackoverflow.com/questions/36675388/efficient-and-user-friend ly-way-to-present-slow-loading-results)
Case at hand: we use Basex to query a 50 million tokens corpus. We also make this available to other users through a website. The thing is that this is slow. For our own projects that’s no problem, we dive straight into the back-end and run a search command from terminal and let the query run for all the time it needs. However, for users it is paramount that they get a quick response. At the moment it is taking too long. I don’t blame BaseX. We love BaseX and are astounded by its efficiency and optimisations! However, we want to deliver the best user-experience to our users.
We call a new session from PHP, wait to receive the results, do some post-processing and then load the result page. As said, this takes too much time. We’ve been looking into some solutions. The best one that I think should be possible, is returning chunks of the results. Do you know those websites that allow you to see results but only, like, 20 per page? I think something similar is appropriate. When a user has searched for a pattern, we only show the 20 or so first results just so they can get an idea of the results they’d find. Then, when they click a button, we should query for the twenty next results which are then appended to the list (JavaScript solution I guess), and so on. Until all results have been found. Additionally, I will also provide a button from which users can download all results in a text file. This is allowed to take a longer time. The main thing is that users should get early feedback and results on their query.
Now the question is if something like this is possible in an efficient manner in BaseX. Can you form a query that only finds the 20 first results, and then the following 20 and so on – and is this faster than searching everything at once? In other words, when I am searching for the results 120-140 (after having pushed the button a couple of times), is BaseX smart enough to skip the search space it has already done to find the 120 previous hits? If so, that would be great. Could you help me on my way, with some PHP/XQuery code that is suitable?
I also highly encourage you to participate on StackOverflow. As I said, I am offering a 200 bounty – for the people who are interested in Internet fame. J
Thank you for your time
Kind regards
Bram Vanroy
Hi Bram,
I have implemented a system where on page load the ten first results are shown, and when a user clicks a button an Ajax request is sent, and the ten next results are looked up and appended to HTML. This works great, especially in cases where many hits are found.
I’m glad to hear it works!
However, this is still quite slow when you are querying for rare cases. Therefore I am curious about your opinion on the following idea:
What if you set the difference between $start and $end to 1, and immediately return each hit, and fire the next ajax request?
Before spending some more work here, I am interested in hearing what exactly is the bottleneck in your setup:
* Have you done some simple performance tests? * Does it take most time to evaluate the query or to return results to the browser? * Would a query that returns single results really be faster than one that returns 10 results? * Do you sort your search results? If yes, returning 100 results instead of 10 should not be much slower.
Thanks in advance Christian
-----Oorspronkelijk bericht----- Van: Christian Grün [mailto:christian.gruen@gmail.com] Verzonden: donderdag 21 april 2016 15:22 Aan: Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be CC: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Querying Basex from a web interface
PS: The current way to only return chunks of the result is to pass on the index to the first and last result in your result, and to do the filtering in XQuery:
$results[position() = $start to $end]
By returning one more result than requested, the client will know that there will be more results. This may be helpful, because computing the total result size is often much more expensive than returning only the first results.
Does this help? Christian
On Thu, Apr 21, 2016 at 3:18 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Bram,
Thanks for your query. This is just a quick response, but you could have a look at the new %rest:single annotation in BaseX [1]. Your feedback is welcome.
Christian
[1] http://docs.basex.org/wiki/RESTXQ#Query_Execution
On Thu, Apr 21, 2016 at 3:11 PM, Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be wrote:
Hello all
I have been reading through many emails on this list and I’ve learnt a lot about how Basex works and how others use it. A month or so back I have sent an email myself to this list concerning caching. Even though I have some more questions about that, I will leave that for another time. Today I am concerned about retrieving chunked input from Basex.
(Question also found on StackOverflow, with a nice bounty! J http://stackoverflow.com/questions/36675388/efficient-and-user-friend ly-way-to-present-slow-loading-results)
Case at hand: we use Basex to query a 50 million tokens corpus. We also make this available to other users through a website. The thing is that this is slow. For our own projects that’s no problem, we dive straight into the back-end and run a search command from terminal and let the query run for all the time it needs. However, for users it is paramount that they get a quick response. At the moment it is taking too long. I don’t blame BaseX. We love BaseX and are astounded by its efficiency and optimisations! However, we want to deliver the best user-experience to our users.
We call a new session from PHP, wait to receive the results, do some post-processing and then load the result page. As said, this takes too much time. We’ve been looking into some solutions. The best one that I think should be possible, is returning chunks of the results. Do you know those websites that allow you to see results but only, like, 20 per page? I think something similar is appropriate. When a user has searched for a pattern, we only show the 20 or so first results just so they can get an idea of the results they’d find. Then, when they click a button, we should query for the twenty next results which are then appended to the list (JavaScript solution I guess), and so on. Until all results have been found. Additionally, I will also provide a button from which users can download all results in a text file. This is allowed to take a longer time. The main thing is that users should get early feedback and results on their query.
Now the question is if something like this is possible in an efficient manner in BaseX. Can you form a query that only finds the 20 first results, and then the following 20 and so on – and is this faster than searching everything at once? In other words, when I am searching for the results 120-140 (after having pushed the button a couple of times), is BaseX smart enough to skip the search space it has already done to find the 120 previous hits? If so, that would be great. Could you help me on my way, with some PHP/XQuery code that is suitable?
I also highly encourage you to participate on StackOverflow. As I said, I am offering a 200 bounty – for the people who are interested in Internet fame. J
Thank you for your time
Kind regards
Bram Vanroy
Hi Christian Hi BaseX folk
I am sorry for the very late reply! As you can imagine I am rather busy. I will try to answer your questions as well as I can below. But before I do that, I think it is important that I provide more information on our set-up. Be warned, this is a very long email!
The promotor of my project has contacted this mailing list before. We are using Basex in quite a peculiar manner. To decrease the search space of our 500 million token corpus, we have introduced a process called GrInding. You can read all about it here [1].
To summarize, as a pre-processing step the corpus is scanned and divided into sections. Each subtree (one parent node and its direct children, recursively) is taken out, and a breadth-first pattern is assigned to it based on properties of that tree. Maybe this is clearer with an example, taken from a thesis that I am writing:
A breadth-first pattern can be seen as a summary of the first level of nodes. From each of these nodes we take the rel attribute and the cat or pt attribute, and join these by a %-sign. The results of all nodes are then sorted alphabetically (to better manage all the data) and concatenated with an underscore separator. Finally, the topmost node’s category is prepended to the whole string. For instance, 1 is an XPath structure based on the dependency structure of de man (the man) which is an NP consisting of a head noun and an article that is a determiner. 2 is its breadth-first representation: the top node’s cat (np) followed by each node’s rel, %, and cat or pt. In other words, det%lid is the first child node, and hd%n represents the second one.
1 /node[@cat="np" and node[@rel="det" and @pt="lid"] and node[@rel="hd" and @pt="n"]] 2 npdet%lid_hd%n
Note that is done for each tree and all its subtrees, so that all patterns in that one tree can be matched. The breadth-first pattern is then used as a wrapper file that, ironically, includes so-called “includes”. These includes are simply file names of patterns that are included in said pattern. As an example, an include of 2 could be npdet%lid_hd%n_mod%adj. The pattern is more specific, and thus should also be found by the more general pattern 2. In addition to includes, these files also contain the subtrees that exactly match the given pattern. The subtrees carry a reference to their original ancestor tree so that when the XPath input matches a found subtree, we can retrieve the whole sentence from a different database, by ID. I hope that this somewhat clarifies the process. As you hopefully gather from this, the purpose of this technique is to narrow done the search space quickly.
The approach explained above also implies that we had to create a lot of BaseX databases. A lot. Around 10 million of them. As a consequence, each query looking for the same XPath in another pattern (i.e. another include) means another query. From testing, it seems that the overhead from this is minimal, and benchmarks show that the GrInded version of our corpus is indeed queried much faster than the original version.
Now, to reply to your questions
• Have you done some simple performance tests? Yes. I have done some simple tests. Comparing the search time when using Basex from the command line, and when using Basex through an online interface. Results are very similar (considering the web interface does some small post-processing steps). This also answers your next question: there is only very little overhead when returning the results to the browser. As I mentioned before, the benchmarks do show that the technique with 10 million of databases does prove to be faster than the other variant.
• Would a query that returns single results really be faster than one that returns 10 results? Yes. In a search space of 500 million tokens, you can imagine that a rare pattern may take a lot of time to query – even in the GrInded version. In theory it is possible that the first data immediately matches a result, and that the second (and last) hit is only found in the very last sentence of said huge corpus. Therefore, considering user-friendliness, we opted for flushing each individual result as it was found. In the theoretical example I just gave, the user would quickly be presented with a matched sentence (we provide means to “explore” that sentence in a graphical user interface) so at least they have something to do while the next sentence is looked for. If they have to wait for that second and last example, they would need to wait for a very long time before anything actually “happens” on their screen.
• Do you sort your search results? If yes, returning 100 results instead of 10 should not be much slower. As I am not entirely sure what you mean by that, I don’t think we do. By sorting, do you mean the XQuery order by function? I hope these questions answer the questions that you have. I have a couple of my own as well.
First of all, from my benchmarks it became clear that the GrInding was a success. From the results, I could also see some strange behaviour taking place: the GrInded version enjoyed a great caching effect, whereas the not grinded benchmark did not. To me this is odd. The GrInded version often does need to go through more databases then the not GrInded version (depending on the query), but wouldn’t that mean that BaseX’ cache is cleared more often? I could imagine that the garbage collector passes by after a query, or at least a session, is closed? Have you any idea how this is possible?
Secondly, and lastly, I do wonder about something in the online implementation that I built. Currently, searching goes like this. First, ajax requests are sent to the server, asking for ONE hit only, always the next one obviously, using the XQuery position function and some PHP variable mixing. As a callback to each request, the next request is sent. After a while (after X seconds or X found hits) I initiate a new ajax request which only counts all hits. This way the user at least gets an idea: currently we’ve already found X hits of Y total hits. And finally, if all results haven’t been found yet, I fire a last request which will search for all results in a single batch. By delaying each step, I am providing the user with some feedback along the way.
My two questions are: is count() actually faster than getting all results? Or does count() get all the hits any way, and should I count and get all results in one step? Secondly, it seems that when the last step is initialised, the other processes hang – leaving the user without any feedback. The processes literally seem to stop running. My question then is: does this happen because BaseX does not handle different sessions asynchronously, and new queries block others? If that is not the problem, I assume I may have an issue with using and modifying locked PHP $_SESSION variables and I’ll have to look into that. Finally, I simply want to ask what the best flow is for opening and closing BaseX sessions, and when one should open a new session. At this moment, I open up a new Session on each iteration, and for counting, and for fetching all results. The DRY enthusiast in my does not like this. The thing is that these scripts are run in separate files, and through an ajax request from the main file so I am not sure how I can “pass a session around”. I suppose that putting a BaseX session as a variable in a SESSION variable won’t work in the most desirable way? Or maybe this is not the recommended set up and I am using BaseX sessions completely wrong? Any insight is greatly appreciated.
I bet that this email was more than you bargained for :-), but I thought it would only be useful if I provided all details to you in full. I hope you find it interesting as well as useful.
Kind regards Thank you in advance Thank you for your patience
Bram Vanroy LinkedIn [2]
[1] http://nederbooms.ccl.kuleuven.be/documentation/LREC2014-GrETELSoNaR.pdf [2] https://www.linkedin.com/in/bramvanroy
-----Oorspronkelijk bericht----- Van: Christian Grün [mailto:christian.gruen@gmail.com] Verzonden: maandag 25 april 2016 12:23 Aan: Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be CC: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Querying Basex from a web interface
Hi Bram,
I have implemented a system where on page load the ten first results are shown, and when a user clicks a button an Ajax request is sent, and the ten next results are looked up and appended to HTML. This works great, especially in cases where many hits are found.
I’m glad to hear it works!
However, this is still quite slow when you are querying for rare cases. Therefore I am curious about your opinion on the following idea:
What if you set the difference between $start and $end to 1, and immediately return each hit, and fire the next ajax request?
Before spending some more work here, I am interested in hearing what exactly is the bottleneck in your setup:
* Have you done some simple performance tests? * Does it take most time to evaluate the query or to return results to the browser? * Would a query that returns single results really be faster than one that returns 10 results? * Do you sort your search results? If yes, returning 100 results instead of 10 should not be much slower.
Thanks in advance Christian
-----Oorspronkelijk bericht----- Van: Christian Grün [mailto:christian.gruen@gmail.com] Verzonden: donderdag 21 april 2016 15:22 Aan: Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be CC: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Querying Basex from a web interface
PS: The current way to only return chunks of the result is to pass on the index to the first and last result in your result, and to do the filtering in XQuery:
$results[position() = $start to $end]
By returning one more result than requested, the client will know that there will be more results. This may be helpful, because computing the total result size is often much more expensive than returning only the first results.
Does this help? Christian
On Thu, Apr 21, 2016 at 3:18 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Bram,
Thanks for your query. This is just a quick response, but you could have a look at the new %rest:single annotation in BaseX [1]. Your feedback is welcome.
Christian
[1] http://docs.basex.org/wiki/RESTXQ#Query_Execution
On Thu, Apr 21, 2016 at 3:11 PM, Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be wrote:
Hello all
I have been reading through many emails on this list and I’ve learnt a lot about how Basex works and how others use it. A month or so back I have sent an email myself to this list concerning caching. Even though I have some more questions about that, I will leave that for another time. Today I am concerned about retrieving chunked input from Basex.
(Question also found on StackOverflow, with a nice bounty! J http://stackoverflow.com/questions/36675388/efficient-and-user-frien d ly-way-to-present-slow-loading-results)
Case at hand: we use Basex to query a 50 million tokens corpus. We also make this available to other users through a website. The thing is that this is slow. For our own projects that’s no problem, we dive straight into the back-end and run a search command from terminal and let the query run for all the time it needs. However, for users it is paramount that they get a quick response. At the moment it is taking too long. I don’t blame BaseX. We love BaseX and are astounded by its efficiency and optimisations! However, we want to deliver the best user-experience to our users.
We call a new session from PHP, wait to receive the results, do some post-processing and then load the result page. As said, this takes too much time. We’ve been looking into some solutions. The best one that I think should be possible, is returning chunks of the results. Do you know those websites that allow you to see results but only, like, 20 per page? I think something similar is appropriate. When a user has searched for a pattern, we only show the 20 or so first results just so they can get an idea of the results they’d find. Then, when they click a button, we should query for the twenty next results which are then appended to the list (JavaScript solution I guess), and so on. Until all results have been found. Additionally, I will also provide a button from which users can download all results in a text file. This is allowed to take a longer time. The main thing is that users should get early feedback and results on their query.
Now the question is if something like this is possible in an efficient manner in BaseX. Can you form a query that only finds the 20 first results, and then the following 20 and so on – and is this faster than searching everything at once? In other words, when I am searching for the results 120-140 (after having pushed the button a couple of times), is BaseX smart enough to skip the search space it has already done to find the 120 previous hits? If so, that would be great. Could you help me on my way, with some PHP/XQuery code that is suitable?
I also highly encourage you to participate on StackOverflow. As I said, I am offering a 200 bounty – for the people who are interested in Internet fame. J
Thank you for your time
Kind regards
Bram Vanroy
Hi Bram,
Thanks for your reply. It’s long indeed, so sorry in advance if I didn’t capture all relevant info…
The approach explained above also implies that we had to create a lot of BaseX databases. A lot. Around 10 million of them.
Impressive :)
• Would a query that returns single results really be faster than one that returns 10 results?
Yes. In a search space of 500 million tokens, you can imagine that a rare pattern may take a lot of time to query – even in the GrInded version.
I see. So I assume there won’t be many chances to speed up this scenario by working on index structures, as most time is spent for sequentially browsing all the databases, right?
• Do you sort your search results? If yes, returning 100 results instead of 10 should not be much slower.
As I am not entirely sure what you mean by that, I don’t think we do. By sorting, do you mean the XQuery order by function?
Exactly. I also assume it shouldn’t play a role in your scenario.
wouldn’t that mean that BaseX’ cache is cleared more often? I could imagine that the garbage collector passes by after a query, or at least a session, is closed? Have you any idea how this is possible?
Phew, a difficult one… I would need to spend some real time with your framework to give a solid answer.
My two questions are: is count() actually faster than getting all results?
Yes, it will always be faster; but “faster” can mean 1% or 1000%… It will be much faster if the database statistics can be utilized to answer your query (which is probably not the case in your scenario), or if the step of retrieving the data, and/or returning it via the network consumes too much time. If you only count nodes, there is no need to retrieve all database contents from disk (node properties, textual data) that will be returned in the XML representation.
Or does count() get all the hits any way, and should I count and get all results in one step?
As you already indicated that the last result may occur much later than the first result in your database(s), I assume you won’t win that much. But for testing, you can wrap your query with count() to see what would be the minimum time to find all hits.
Secondly, it seems that when the last step is initialised, the other processes hang – leaving the user without any feedback. The processes literally seem to stop running. My question then is: does this happen because BaseX does not handle different sessions asynchronously, and new queries block others?
By default, 8 queries can be run in parallel [1]. If your other queries are delayed a lot, it may be that the random disk access pattern causes by parallel queries outweigh the advantage of allowing parallel requests. But, in the first place, I would also assume that it’s worth checking your PHP environment first.
Finally, I simply want to ask what the best flow is for opening and closing BaseX sessions, and when one should open a new session.
With the light-weight PHP client, it’s usually best to open a new session for each request, and close it directly after your command or query has been evaluated. As usual, you should ensure that every session will be closed, even if an error occurs.
Hope this helps, Christian
Hi there Christian
Thank you for the extensive answer!
In the meanwhile, I have solved the issue I had that caused simultaneous queries not to fire asynchronously. The problem was a locked PHP $_SESSION variable. Not related to Basex. My bad!
I am going to test the caching difference again soon, on a larger subcorpus. I'm curious to find out the results! I will incorporate results in a thesis. If you're interested, I can definitely share the results with you when I have finished writing them down!
Finally, I'd like to thank you for providing these answers. Never expected such good feedback and response. It really means a lot! Thank you!
Kind regards
Bram
-----Oorspronkelijk bericht----- Van: Christian Grün [mailto:christian.gruen@gmail.com] Verzonden: zondag 22 mei 2016 13:29 Aan: Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be CC: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Querying Basex from a web interface
Hi Bram,
Thanks for your reply. It’s long indeed, so sorry in advance if I didn’t capture all relevant info…
The approach explained above also implies that we had to create a lot of BaseX databases. A lot. Around 10 million of them.
Impressive :)
• Would a query that returns single results really be faster than one that returns 10 results?
Yes. In a search space of 500 million tokens, you can imagine that a rare pattern may take a lot of time to query – even in the GrInded version.
I see. So I assume there won’t be many chances to speed up this scenario by working on index structures, as most time is spent for sequentially browsing all the databases, right?
• Do you sort your search results? If yes, returning 100 results instead of 10 should not be much slower.
As I am not entirely sure what you mean by that, I don’t think we do. By sorting, do you mean the XQuery order by function?
Exactly. I also assume it shouldn’t play a role in your scenario.
wouldn’t that mean that BaseX’ cache is cleared more often? I could imagine that the garbage collector passes by after a query, or at least a session, is closed? Have you any idea how this is possible?
Phew, a difficult one… I would need to spend some real time with your framework to give a solid answer.
My two questions are: is count() actually faster than getting all results?
Yes, it will always be faster; but “faster” can mean 1% or 1000%… It will be much faster if the database statistics can be utilized to answer your query (which is probably not the case in your scenario), or if the step of retrieving the data, and/or returning it via the network consumes too much time. If you only count nodes, there is no need to retrieve all database contents from disk (node properties, textual data) that will be returned in the XML representation.
Or does count() get all the hits any way, and should I count and get all results in one step?
As you already indicated that the last result may occur much later than the first result in your database(s), I assume you won’t win that much. But for testing, you can wrap your query with count() to see what would be the minimum time to find all hits.
Secondly, it seems that when the last step is initialised, the other processes hang – leaving the user without any feedback. The processes literally seem to stop running. My question then is: does this happen because BaseX does not handle different sessions asynchronously, and new queries block others?
By default, 8 queries can be run in parallel [1]. If your other queries are delayed a lot, it may be that the random disk access pattern causes by parallel queries outweigh the advantage of allowing parallel requests. But, in the first place, I would also assume that it’s worth checking your PHP environment first.
Finally, I simply want to ask what the best flow is for opening and closing BaseX sessions, and when one should open a new session.
With the light-weight PHP client, it’s usually best to open a new session for each request, and close it directly after your command or query has been evaluated. As usual, you should ensure that every session will be closed, even if an error occurs.
Hope this helps, Christian
I am going to test the caching difference again soon, on a larger subcorpus. I'm curious to find out the results! I will incorporate results in a thesis. If you're interested, I can definitely share the results with you when I have finished writing them down!
Sounds interesting, thanks!
-----Oorspronkelijk bericht----- Van: Christian Grün [mailto:christian.gruen@gmail.com] Verzonden: zondag 22 mei 2016 13:29 Aan: Bram Vanroy | KU Leuven bram.vanroy1@student.kuleuven.be CC: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Querying Basex from a web interface
Hi Bram,
Thanks for your reply. It’s long indeed, so sorry in advance if I didn’t capture all relevant info…
The approach explained above also implies that we had to create a lot of BaseX databases. A lot. Around 10 million of them.
Impressive :)
• Would a query that returns single results really be faster than one that returns 10 results?
Yes. In a search space of 500 million tokens, you can imagine that a rare pattern may take a lot of time to query – even in the GrInded version.
I see. So I assume there won’t be many chances to speed up this scenario by working on index structures, as most time is spent for sequentially browsing all the databases, right?
• Do you sort your search results? If yes, returning 100 results instead of 10 should not be much slower.
As I am not entirely sure what you mean by that, I don’t think we do. By sorting, do you mean the XQuery order by function?
Exactly. I also assume it shouldn’t play a role in your scenario.
wouldn’t that mean that BaseX’ cache is cleared more often? I could imagine that the garbage collector passes by after a query, or at least a session, is closed? Have you any idea how this is possible?
Phew, a difficult one… I would need to spend some real time with your framework to give a solid answer.
My two questions are: is count() actually faster than getting all results?
Yes, it will always be faster; but “faster” can mean 1% or 1000%… It will be much faster if the database statistics can be utilized to answer your query (which is probably not the case in your scenario), or if the step of retrieving the data, and/or returning it via the network consumes too much time. If you only count nodes, there is no need to retrieve all database contents from disk (node properties, textual data) that will be returned in the XML representation.
Or does count() get all the hits any way, and should I count and get all results in one step?
As you already indicated that the last result may occur much later than the first result in your database(s), I assume you won’t win that much. But for testing, you can wrap your query with count() to see what would be the minimum time to find all hits.
Secondly, it seems that when the last step is initialised, the other processes hang – leaving the user without any feedback. The processes literally seem to stop running. My question then is: does this happen because BaseX does not handle different sessions asynchronously, and new queries block others?
By default, 8 queries can be run in parallel [1]. If your other queries are delayed a lot, it may be that the random disk access pattern causes by parallel queries outweigh the advantage of allowing parallel requests. But, in the first place, I would also assume that it’s worth checking your PHP environment first.
Finally, I simply want to ask what the best flow is for opening and closing BaseX sessions, and when one should open a new session.
With the light-weight PHP client, it’s usually best to open a new session for each request, and close it directly after your command or query has been evaluated. As usual, you should ensure that every session will be closed, even if an error occurs.
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de