Exporting large table from RDBMS

List overview All Threads
Download

newer

older

Testing one data directory per...

xquery result limit

Marco Lettere

11 Nov 2016 11 Nov '16

4:56 a.m.

Hi all,

I'm currently facing the issue of exporting a large table stored in a MySQL instance to a csv that has then to be zipped and served through a web service.

I'd love to use XQuery and the Basex functionality for doing it because it feels perfectly natural.

The problem is that the XQuery script that performs the SQL Query and transformation to CSV crashes with an Out of Memory exception.

I know how to increase memory and so on but I'd like to find a stable solution that is as less as possible dependent from the system requirements.

Currently the only idea is to page the operation and (maybe exploiting parallel execution) build an intermediate result on disk files that will be assembled before delivery.

Isn't there any more elegant solution based somehow on the streaming functionality or anything like this?

Thanks for any advice you can provide.

Regards,

Marco.

Show replies by date

Christian Grün

11 Nov 11 Nov

8:18 a.m.

Hi Marco,

Generally, the BaseX SQL Module is capable of streaming results. It is usually up to the corresponding SQL Java driver how large result sets are handled. Maybe you can provide us with the Java stack trace; it could help finding the culprit.

Cheers, Christian

On Fri, Nov 11, 2016 at 10:56 AM, Marco Lettere m.lettere@gmail.com wrote:

...

Hi all,

I'm currently facing the issue of exporting a large table stored in a MySQL instance to a csv that has then to be zipped and served through a web service.

I'd love to use XQuery and the Basex functionality for doing it because it feels perfectly natural.

The problem is that the XQuery script that performs the SQL Query and transformation to CSV crashes with an Out of Memory exception.

I know how to increase memory and so on but I'd like to find a stable solution that is as less as possible dependent from the system requirements.

Currently the only idea is to page the operation and (maybe exploiting parallel execution) build an intermediate result on disk files that will be assembled before delivery.

Isn't there any more elegant solution based somehow on the streaming functionality or anything like this?

Thanks for any advice you can provide.

Regards,

Marco.

Hans-Juergen Rennau

8:22 a.m.

Marco, I've faced the same problem and frustration, and I agree that it would be great if BaseX could be used for such tasks. Strengthening its potential role as a data integration engine.Kind regards,Hans

Marco Lettere m.lettere@gmail.com schrieb am 10:57 Freitag, 11.November 2016:

Hi all,

I'm currently facing the issue of exporting a large table stored in a MySQL instance to a csv that has then to be zipped and served through a web service.

I'd love to use XQuery and the Basex functionality for doing it because it feels perfectly natural.

The problem is that the XQuery script that performs the SQL Query and transformation to CSV crashes with an Out of Memory exception.

I know how to increase memory and so on but I'd like to find a stable solution that is as less as possible dependent from the system requirements.

Currently the only idea is to page the operation and (maybe exploiting parallel execution) build an intermediate result on disk files that will be assembled before delivery.

Isn't there any more elegant solution based somehow on the streaming functionality or anything like this?

Thanks for any advice you can provide.

Regards,

Marco.

Christian Grün

8:27 a.m.

…more input would be welcome ;)

...

Marco, I've faced the same problem and frustration, and I agree that it would be great if BaseX could be used for such tasks. Strengthening its potential role as a data integration engine. Kind regards, Hans

Marco Lettere m.lettere@gmail.com schrieb am 10:57 Freitag, 11.November 2016:

Hi all,

I'm currently facing the issue of exporting a large table stored in a MySQL instance to a csv that has then to be zipped and served through a web service.

I'd love to use XQuery and the Basex functionality for doing it because it feels perfectly natural.

The problem is that the XQuery script that performs the SQL Query and transformation to CSV crashes with an Out of Memory exception.

I know how to increase memory and so on but I'd like to find a stable solution that is as less as possible dependent from the system requirements.

Currently the only idea is to page the operation and (maybe exploiting parallel execution) build an intermediate result on disk files that will be assembled before delivery.

Isn't there any more elegant solution based somehow on the streaming functionality or anything like this?

Thanks for any advice you can provide.

Regards,

Marco.

Marco Lettere

9:51 a.m.

Hi Christian, as usual thanks for the attention. I send you the query [1], the transformation function [2] and the stacktrace [4]. Database is MySQL 5.7.13 with jdbc driver mysql-connector-java-5.1.39-bin.jar. Everything deployed on a single host running win7 32 bit. Thanks again! Marco.

[1] The query: declare function j:export($table as xs:string) { let $connection := sql:connect(conf:get("jdbc.url"), conf:get("jdbc.user"), conf:get("jdbc.pass")) let $results := sql:execute($connection, "SELECT * FROM " || $table) let $close := sql:close($connection) return $results };

[2] The transformation function declare function ioi:to-cvs($sql-results as node()+) { <csv>{ $sql-results ! <record> <id>{./sql:column[lower-case(@name)="id"]/string()}</id> <name>{./sql:column[lower-case(@name)="name"]/string()}</name> <timestamps>{./sql:column[lower-case(@name)="timestamp"]/string()}</timestamps>

<value>{./sql:column[lower-case(@name)="value"]/string()}</value> </record> } </csv> };

[3] The exception java.lang.OutOfMemoryError: Java heap space at java.lang.Long.toString(Unknown Source) at java.lang.Long.toString(Unknown Source) at java.math.BigInteger.smallToString(Unknown Source) at java.math.BigInteger.toString(Unknown Source) at java.math.BigInteger.toString(Unknown Source) at org.basex.query.func.sql.SqlExecute$1.next(SqlExecute.java:94) at org.basex.query.iter.Iter.value(Iter.java:58) at org.basex.query.expr.ParseExpr.value(ParseExpr.java:68) at org.basex.query.QueryContext.value(QueryContext.java:410) at org.basex.query.expr.gflwor.Let$LetEval.next(Let.java:183) at org.basex.query.expr.gflwor.GFLWOR$1.next(GFLWOR.java:95) at org.basex.query.expr.IterMap$1.next(IterMap.java:48) at org.basex.query.iter.Iter.value(Iter.java:51) at org.basex.query.expr.ParseExpr.value(ParseExpr.java:68) at org.basex.query.QueryContext.value(QueryContext.java:410) at org.basex.query.expr.gflwor.Let$LetEval.next(Let.java:183) at org.basex.query.expr.gflwor.GFLWOR.value(GFLWOR.java:71) at org.basex.query.QueryContext.value(QueryContext.java:410) at org.basex.query.func.StaticFunc.invValue(StaticFunc.java:178) at org.basex.query.func.FuncCall.invoke(FuncCall.java:87) at org.basex.query.func.FuncCall.value(FuncCall.java:61) at org.basex.query.QueryContext.value(QueryContext.java:410) at org.basex.query.expr.gflwor.Let$LetEval.next(Let.java:183) at org.basex.query.expr.gflwor.GFLWOR$1.next(GFLWOR.java:95) at org.basex.query.expr.gflwor.GFLWOR$1.next(GFLWOR.java:92) at org.basex.query.MainModule$1.next(MainModule.java:121) at org.basex.http.restxq.RestXqResponse.create(RestXqResponse.java:55) at org.basex.http.restxq.RestXqModule.process(RestXqModule.java:105) at org.basex.http.restxq.RestXqFunction.process(RestXqFunction.java:107) at org.basex.http.restxq.RestXqServlet.run(RestXqServlet.java:44) at org.basex.http.BaseXServlet.service(BaseXServlet.java:65) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)

On 11/11/2016 14:27, Christian Grün wrote:

...

…more input would be welcome ;)

...
Marco, I've faced the same problem and frustration, and I agree that it would be great if BaseX could be used for such tasks. Strengthening its potential role as a data integration engine. Kind regards, Hans

Marco Lettere m.lettere@gmail.com schrieb am 10:57 Freitag, 11.November 2016:

Hi all,

I'm currently facing the issue of exporting a large table stored in a MySQL instance to a csv that has then to be zipped and served through a web service.

I'd love to use XQuery and the Basex functionality for doing it because it feels perfectly natural.

The problem is that the XQuery script that performs the SQL Query and transformation to CSV crashes with an Out of Memory exception.

I know how to increase memory and so on but I'd like to find a stable solution that is as less as possible dependent from the system requirements.

Currently the only idea is to page the operation and (maybe exploiting parallel execution) build an intermediate result on disk files that will be assembled before delivery.

Isn't there any more elegant solution based somehow on the streaming functionality or anything like this?

Thanks for any advice you can provide.

Regards,

Marco.

Christian Grün

10:11 a.m.

Thanks! This looks like a blocker:

...

<csv>{
    $sql-results !
        <record>
<id>{./sql:column[lower-case(@name)="id"]/string()}</id> <name>{./sql:column[lower-case(@name)="name"]/string()}</name> <timestamps>{./sql:column[lower-case(@name)="timestamp"]/string()}</timestamps> <value>{./sql:column[lower-case(@name)="value"]/string()}</value> </record> } </csv> };

As you are wrapping a root element around all results, everything needs to be cached in main memory and thus prevents streaming.

Marco Lettere

14 Nov 14 Nov

6:47 a.m.

Hi all, back with some results....

We have been able to stream the data from mysql into a string representation of the expected CSV rewriting the functions like this:

declare function j:csv-export($table as xs:string) { let $connection := sql:connect(conf:get("jdbc.url"), conf:get("jdbc.user"), conf:get("jdbc.pass")) let $tmp-file := string-join(('"id","name","timestamp","value"', sql:execute($connection, "SELECT * FROM " || $table || $where) ! (out:nl() || j:to-cvs-row(.)))) return ($tmp-file, sql:close($connection)) };

declare function j:to-cvs-row($sql-result as node()) { $sql-result/sql:column[lower-case(@name)="id"]/string() || conf:get("csv.sep") || '"' || $sql-result/sql:column[lower-case(@name)="name"]/string() || '"' || conf:get("csv.sep") || '"' || $sql-result/sql:column[lower-case(@name)="timestamp"]/string() || '"' || conf:get("csv.sep") || '"' || $sql-result/sql:column[lower-case(@name)="value"]/string() || '"' };

Now, since we have to export actually three or four tables from the rdbms into a zip archive, the only bottleneck that remains is that still the whole string for each csv has to be kept in memory before zipping it to the archive together with some small metadata.

We use something like the following code:

let $metadata-entry := <archive:entry last-modified='{current-dateTime()}'compression-level='8'>metadata.json</archive:entry> let $table-entry := $tables ! <archive:entry last-modified='{current-dateTime()}' compression-level='8'>{ . || ".csv"}</archive:entry> let $table-data := $tables ! ji:csv-export(.) return archive:create(($metadata-entry, $table-entries), (serialize($metadata), $table-data)))

Are there suggestions to keep the very effective "streaming behaviour" also while adding the entries to the zip archive? Thanks a lot! Marco.

On 11/11/2016 16:11, Christian Grün wrote:

...

Thanks! This looks like a blocker:

...
 <csv>{
     $sql-results !
         <record>
<id>{./sql:column[lower-case(@name)="id"]/string()}</id> <name>{./sql:column[lower-case(@name)="name"]/string()}</name> <timestamps>{./sql:column[lower-case(@name)="timestamp"]/string()}</timestamps> <value>{./sql:column[lower-case(@name)="value"]/string()}</value> </record> } </csv> };
As you are wrapping a root element around all results, everything needs to be cached in main memory and thus prevents streaming.

Christian Grün

6:53 a.m.

Did you try to use file:write-text() and file:append-text(), and address the result later on?

On Mon, Nov 14, 2016 at 12:47 PM, Marco Lettere m.lettere@gmail.com wrote:

...

Hi all, back with some results....

We have been able to stream the data from mysql into a string representation of the expected CSV rewriting the functions like this:

declare function j:csv-export($table as xs:string) { let $connection := sql:connect(conf:get("jdbc.url"), conf:get("jdbc.user"), conf:get("jdbc.pass")) let $tmp-file := string-join(('"id","name","timestamp","value"', sql:execute($connection, "SELECT * FROM " || $table || $where) ! (out:nl() || j:to-cvs-row(.)))) return ($tmp-file, sql:close($connection)) };

declare function j:to-cvs-row($sql-result as node()) { $sql-result/sql:column[lower-case(@name)="id"]/string() || conf:get("csv.sep") || '"' || $sql-result/sql:column[lower-case(@name)="name"]/string() || '"' || conf:get("csv.sep") || '"' || $sql-result/sql:column[lower-case(@name)="timestamp"]/string() || '"' || conf:get("csv.sep") || '"' || $sql-result/sql:column[lower-case(@name)="value"]/string() || '"' };

Now, since we have to export actually three or four tables from the rdbms into a zip archive, the only bottleneck that remains is that still the whole string for each csv has to be kept in memory before zipping it to the archive together with some small metadata.

We use something like the following code:

let $metadata-entry := archive:entry last-modified='{current-dateTime()}'compression-level='8'metadata.json</archive:entry> let $table-entry := $tables ! <archive:entry last-modified='{current-dateTime()}' compression-level='8'>{ . || ".csv"}</archive:entry> let $table-data := $tables ! ji:csv-export(.) return archive:create(($metadata-entry, $table-entries), (serialize($metadata), $table-data)))

Are there suggestions to keep the very effective "streaming behaviour" also while adding the entries to the zip archive? Thanks a lot! Marco.

On 11/11/2016 16:11, Christian Grün wrote:

...
Thanks! This looks like a blocker:

...
 <csv>{
     $sql-results !
         <record>
<id>{./sql:column[lower-case(@name)="id"]/string()}</id> <name>{./sql:column[lower-case(@name)="name"]/string()}</name>

<timestamps>{./sql:column[lower-case(@name)="timestamp"]/string()}</timestamps> <value>{./sql:column[lower-case(@name)="value"]/string()}</value> </record> } </csv> };
As you are wrapping a root element around all results, everything needs to be cached in main memory and thus prevents streaming.

3260

Age (days ago)

3263

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

7 comments

3 participants

tags (0)

participants (3)

Christian Grün
Hans-Juergen Rennau
Marco Lettere