Hello,
I have been agonizing over a problem with a service I'm trying to call from RESTXQ. The service (www.tei-c.org/oxgarage/) only accepts multipart/form-data submissions. It provides a front-end client for uploading files from the browser, which calls a back-end RESTful service to do document conversion.
I've installed a local instance of the service ( https://github.com/TEIC/oxgarage) and have it running on Tomcat 7, along with BaseX.
The problem arises when I try to submit documents with non-ASCII characters from RESTXQ. Looking at the network traffic, I can see that if the document contains only ASCII characters, the multipart submission body is not base64 encoded. For example:
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name='fileToConvert'; filename='homework.xml'\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta/> <title> Test </title> </head> <body> <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi/> <mn> 2 </mn> </msub> </math> </body> </html>
However, if the document does contain non-ASCII characters (such as β), BaseX sets the Content-Transfer-Encoding to "base64." This causes the OxGarage service to fail because it thinks it is receiving an image file rather than a textual document. For example:
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxoZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgvTWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML page with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding as "binary" instead of "base64," but it doesn't replace the default header. Is there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Hi Tim,
Sorry for the late response. I could reproduce the problem, and I see that it makes a difference if you send pure ASCII or other unicode characters. I haven’t tracked this down yet, but I’ll give you an update soon.
Cheers, Christian
On Mon, Feb 27, 2017 at 7:06 PM, Tim Thompson timathom@gmail.com wrote:
Hello,
I have been agonizing over a problem with a service I'm trying to call from RESTXQ. The service (www.tei-c.org/oxgarage/) only accepts multipart/form-data submissions. It provides a front-end client for uploading files from the browser, which calls a back-end RESTful service to do document conversion.
I've installed a local instance of the service (https://github.com/TEIC/oxgarage) and have it running on Tomcat 7, along with BaseX.
The problem arises when I try to submit documents with non-ASCII characters from RESTXQ. Looking at the network traffic, I can see that if the document contains only ASCII characters, the multipart submission body is not base64 encoded. For example:
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name='fileToConvert'; filename='homework.xml'\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta/> <title> Test </title> </head> <body> <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi/> <mn> 2 </mn> </msub> </math> </body> </html>
However, if the document does contain non-ASCII characters (such as β), BaseX sets the Content-Transfer-Encoding to "base64." This causes the OxGarage service to fail because it thinks it is receiving an image file rather than a textual document. For example:
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxoZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgvTWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML page with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding as "binary" instead of "base64," but it doesn't replace the default header. Is there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxoZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgvTWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML page with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding as "binary" instead of "base64," but it doesn't replace the default header. Is there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Hi, Christian,
Thanks very much for looking into this. If I use the OxGarage TEI web service through the front-end client to upload a file ( http://www.tei-c.org/oxgarage/), here is how it sends the request payload on the back end. Non-ASCII characters are replaced with octal escape sequences.
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name="fileToConvert"; filename="tei.xml"\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Multipart test</title> <author/> </titleStmt> <publicationStmt> <p>unknown</p> </publicationStmt> <sourceDesc> <p>unknown</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <div type="level1"> <div type="level2"> <p n="4"> <hi rendition="simple:bold"/> </p> <p n="5" rend="Normal"> <hi rend="bold underline"> Regression Equation </hi> </p> <p n="6" rend="Normal"> <math xmlns=" http://www.w3.org/1998/Math/MathML"> <mover accent="true"> <mrow> <mi> Y </mi> </mrow> <mo> ^ </mo> </mover> <mo> = </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 1 </mn> </mrow> </msub> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <mo> + </mo> <mo> \342\200\246 </mo> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> </math> </p> </div> </div> </body> </text> </TEI> Boundary: \r\n-----------------------------10775069631632435281298450283\r\n
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Sat, Mar 11, 2017 at 10:30 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/ src/main/java/org/basex/util/http/HttpClient.java#L271
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxo
ZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5PjxtYXRoIHhtbG 5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgvTWF0aE1MIj48bXN1Yj48bWk+ zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML page with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding as "binary" instead of "base64," but it doesn't replace the default header.
Is
there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Non-ASCII characters are replaced with octal escape sequences.
Thanks Tim, this is a great hint.
I feel I will need to dig much deeper into multipart encoding again. Currently, our HTTP client code sends ASCII bodies unchanged. This was implemented under the assumption is that input like…
<xml>\101</xml>
...would be adopted unchanged by the server. However, as it seems, the OxGarage TEI web service will unescape the body and interpret it as <xml>A</xml>.
I think I’m asking too much, but have you possibly spent more time on this? My instantaneous searches and RFC lookups for octal encoding in multipart form data haven’t been that successful.
Hi,
this might differ from sending xml files, but if you sent any other file (image, word document) there is usually no conversion at all - just sending plain bytes (the headers do not even mention any encoding).
From my understanding, it would be the users responsibilty to decide over
the transfer encoding (if you do not specify it, then there might be some fallback, but currently you are forced to base64 - no matter what the headers already are).
Br, Max
2017-03-11 18:17 GMT+01:00 Tim Thompson timathom@gmail.com:
Hi, Christian,
Thanks very much for looking into this. If I use the OxGarage TEI web service through the front-end client to upload a file ( http://www.tei-c.org/oxgarage/), here is how it sends the request payload on the back end. Non-ASCII characters are replaced with octal escape sequences.
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name="fileToConvert"; filename="tei.xml"\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Multipart test</title> <author/> </titleStmt> <publicationStmt> <p>unknown</p> </publicationStmt> <sourceDesc> <p>unknown</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <div type="level1"> <div type="level2"> <p n="4"> <hi rendition="simple:bold"/> </p> <p n="5" rend="Normal"> <hi rend="bold underline"> Regression Equation </hi> </p> <p n="6" rend="Normal"> <math xmlns="http://www.w3.org/1998/ Math/MathML"> <mover accent="true"> <mrow> <mi> Y </mi> </mrow> <mo> ^ </mo> </mover> <mo> = </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 1 </mn> </mrow> </msub> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <mo> + </mo> <mo> \342\200\246 </mo> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> </math> </p> </div> </div> </body> </text> </TEI> Boundary: \r\n----------------------------- 10775069631632435281298450283\r\n
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Sat, Mar 11, 2017 at 10:30 AM, Christian Grün < christian.gruen@gmail.com> wrote:
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/ main/java/org/basex/util/http/HttpClient.java#L271
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxo
ZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5 PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgv TWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML
page
with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding as "binary" instead of "base64," but it doesn't replace the default
header. Is
there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Christian,
For the sake of comparison, I'm attaching two text files with the HTTP request/response output from my sample query on both the BaseX and eXist RESTXQ implementations. Non-ASCII characters do not seem to be a problem in the eXist implementation. You can see that eXist uses "Transer-Encoding: chunked" whereas BaseX uses "Content-Transfer-Encoding: base64." But I'm afraid I'm getting out of my depth here!
Thanks again,
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Mon, Mar 13, 2017 at 12:00 PM, Maximilian Gärber mgaerber@arcor.de wrote:
Hi,
this might differ from sending xml files, but if you sent any other file (image, word document) there is usually no conversion at all - just sending plain bytes (the headers do not even mention any encoding).
From my understanding, it would be the users responsibilty to decide over the transfer encoding (if you do not specify it, then there might be some fallback, but currently you are forced to base64 - no matter what the headers already are).
Br, Max
2017-03-11 18:17 GMT+01:00 Tim Thompson timathom@gmail.com:
Hi, Christian,
Thanks very much for looking into this. If I use the OxGarage TEI web service through the front-end client to upload a file ( http://www.tei-c.org/oxgarage/), here is how it sends the request payload on the back end. Non-ASCII characters are replaced with octal escape sequences.
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name="fileToConvert"; filename="tei.xml"\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Multipart test</title> <author/> </titleStmt> <publicationStmt> <p>unknown</p> </publicationStmt> <sourceDesc> <p>unknown</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <div type="level1"> <div type="level2"> <p n="4"> <hi rendition="simple:bold"/> </p> <p n="5" rend="Normal"> <hi rend="bold underline"> Regression Equation </hi> </p> <p n="6" rend="Normal"> <math xmlns="http://www.w3.org/1998/ Math/MathML"> <mover accent="true"> <mrow> <mi> Y </mi> </mrow> <mo> ^ </mo> </mover> <mo> = </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 1 </mn> </mrow> </msub> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <mo> + </mo> <mo> \342\200\246 </mo> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> </math> </p> </div> </div> </body> </text> </TEI> Boundary: \r\n-------------------------- ---10775069631632435281298450283\r\n
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Sat, Mar 11, 2017 at 10:30 AM, Christian Grün < christian.gruen@gmail.com> wrote:
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/ main/java/org/basex/util/http/HttpClient.java#L271
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxo
ZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5 PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgv TWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML
page
with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding as "binary" instead of "base64," but it doesn't replace the default
header. Is
there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Dear Tim,
No progress so far, I’m sorry, but it was interesting to see the differences in the requests of BaseX and eXist. I was mostly wondering why there are so many parts in the BaseX multipart message, which seem to be completely missing in the eXist request. Are both outputs based on the same XQuery expression? Did you manage to run the same query with both implementations?
Thanks, Christian
On Mon, Mar 13, 2017 at 8:27 PM, Tim Thompson timathom@gmail.com wrote:
Christian,
For the sake of comparison, I'm attaching two text files with the HTTP request/response output from my sample query on both the BaseX and eXist RESTXQ implementations. Non-ASCII characters do not seem to be a problem in the eXist implementation. You can see that eXist uses "Transer-Encoding: chunked" whereas BaseX uses "Content-Transfer-Encoding: base64." But I'm afraid I'm getting out of my depth here!
Thanks again,
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Mon, Mar 13, 2017 at 12:00 PM, Maximilian Gärber mgaerber@arcor.de wrote:
Hi,
this might differ from sending xml files, but if you sent any other file (image, word document) there is usually no conversion at all - just sending plain bytes (the headers do not even mention any encoding).
From my understanding, it would be the users responsibilty to decide over the transfer encoding (if you do not specify it, then there might be some fallback, but currently you are forced to base64 - no matter what the headers already are).
Br, Max
2017-03-11 18:17 GMT+01:00 Tim Thompson timathom@gmail.com:
Hi, Christian,
Thanks very much for looking into this. If I use the OxGarage TEI web service through the front-end client to upload a file ( http://www.tei-c.org/oxgarage/), here is how it sends the request payload on the back end. Non-ASCII characters are replaced with octal escape sequences.
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name="fileToConvert"; filename="tei.xml"\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Multipart test</title> <author/> </titleStmt> <publicationStmt> <p>unknown</p> </publicationStmt> <sourceDesc> <p>unknown</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <div type="level1"> <div type="level2"> <p n="4"> <hi rendition="simple:bold"/> </p> <p n="5" rend="Normal"> <hi rend="bold underline"> Regression Equation </hi> </p> <p n="6" rend="Normal"> <math xmlns="http://www.w3.org/1998/ Math/MathML"> <mover accent="true"> <mrow> <mi> Y </mi> </mrow> <mo> ^ </mo> </mover> <mo> = </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 1 </mn> </mrow> </msub> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <mo> + </mo> <mo> \342\200\246 </mo> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> </math> </p> </div> </div> </body> </text> </TEI> Boundary: \r\n-------------------------- ---10775069631632435281298450283\r\n
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Sat, Mar 11, 2017 at 10:30 AM, Christian Grün < christian.gruen@gmail.com> wrote:
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/ main/java/org/basex/util/http/HttpClient.java#L271
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxo
ZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5 PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgv TWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML
page
with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding
as
"binary" instead of "base64," but it doesn't replace the default
header. Is
there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Hi, Christian, thanks for the update. As you noticed, the original multipart request had several parts to specify different parameters for the web service. This ran correctly in BaseX, but not in eXist, so I removed them from the eXist example, since it wasn't immediately relevant to the encoding issue. In my haste, I forgot to ensure that the two examples were identical :)
Best, Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Mon, Mar 20, 2017 at 7:10 AM, Christian Grün christian.gruen@gmail.com wrote:
Dear Tim,
No progress so far, I’m sorry, but it was interesting to see the differences in the requests of BaseX and eXist. I was mostly wondering why there are so many parts in the BaseX multipart message, which seem to be completely missing in the eXist request. Are both outputs based on the same XQuery expression? Did you manage to run the same query with both implementations?
Thanks, Christian
On Mon, Mar 13, 2017 at 8:27 PM, Tim Thompson timathom@gmail.com wrote:
Christian,
For the sake of comparison, I'm attaching two text files with the HTTP request/response output from my sample query on both the BaseX and eXist RESTXQ implementations. Non-ASCII characters do not seem to be a problem in the eXist implementation. You can see that eXist uses "Transer-Encoding: chunked" whereas BaseX uses "Content-Transfer-Encoding: base64." But I'm afraid I'm getting out of my depth here!
Thanks again,
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Mon, Mar 13, 2017 at 12:00 PM, Maximilian Gärber mgaerber@arcor.de wrote:
Hi,
this might differ from sending xml files, but if you sent any other file (image, word document) there is usually no conversion at all - just sending plain bytes (the headers do not even mention any encoding).
From my understanding, it would be the users responsibilty to decide over the transfer encoding (if you do not specify it, then there might be some fallback, but currently you are forced to base64 - no matter what the headers already are).
Br, Max
2017-03-11 18:17 GMT+01:00 Tim Thompson timathom@gmail.com:
Hi, Christian,
Thanks very much for looking into this. If I use the OxGarage TEI web service through the front-end client to upload a file ( http://www.tei-c.org/oxgarage/), here is how it sends the request payload on the back end. Non-ASCII characters are replaced with octal escape sequences.
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name="fileToConvert"; filename="tei.xml"\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Multipart test</title> <author/> </titleStmt> <publicationStmt> <p>unknown</p> </publicationStmt> <sourceDesc> <p>unknown</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <div type="level1"> <div type="level2"> <p n="4"> <hi rendition="simple:bold"/> </p> <p n="5" rend="Normal"> <hi rend="bold underline"> Regression Equation </hi> </p> <p n="6" rend="Normal"> <math xmlns=" http://www.w3.org/1998/Math/MathML"> <mover accent="true"> <mrow> <mi> Y </mi> </mrow> <mo> ^ </mo> </mover> <mo> = </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 1 </mn> </mrow> </msub> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <mo> + </mo> <mo> \342\200\246 </mo> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> </math> </p> </div> </div> </body> </text> </TEI> Boundary: \r\n-------------------------- ---10775069631632435281298450283\r\n
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Sat, Mar 11, 2017 at 10:30 AM, Christian Grün < christian.gruen@gmail.com> wrote:
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/ main/java/org/basex/util/http/HttpClient.java#L271
Content-Type: text/xml\r\n Content-Transfer-Encoding: base64\r\n\r\n eXtensible Markup Language [truncated] PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxo
ZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5 PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgv TWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21
Attached here is a basic test case to replicate the problem: an HTML
page
with a form and the RESTXQ function that it calls.
I've tried setting a new header to specify Content-Transfer-Encoding
as
"binary" instead of "base64," but it doesn't replace the default
header. Is
there any way that the encoding could be controlled from RESTXQ?
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
Hi Tim,
I see. Such a pity, so this means we currently have no implementation that successfully manages to upload the data to the TEI page, is this correct? It would be interesting if the upload succeeds with the EXPath implementations from Florent Georges [1]; would you be interesting in giving it a try? ;) There is both a generic client and a Saxon implementation available. Up to now, I didn’t try any of them.
Christian
On Mon, Mar 20, 2017 at 12:46 PM, Tim Thompson timathom@gmail.com wrote:
Hi, Christian, thanks for the update. As you noticed, the original multipart request had several parts to specify different parameters for the web service. This ran correctly in BaseX, but not in eXist, so I removed them from the eXist example, since it wasn't immediately relevant to the encoding issue. In my haste, I forgot to ensure that the two examples were identical :)
Best, Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Mon, Mar 20, 2017 at 7:10 AM, Christian Grün <christian.gruen@gmail.com
wrote:
Dear Tim,
No progress so far, I’m sorry, but it was interesting to see the differences in the requests of BaseX and eXist. I was mostly wondering why there are so many parts in the BaseX multipart message, which seem to be completely missing in the eXist request. Are both outputs based on the same XQuery expression? Did you manage to run the same query with both implementations?
Thanks, Christian
On Mon, Mar 13, 2017 at 8:27 PM, Tim Thompson timathom@gmail.com wrote:
Christian,
For the sake of comparison, I'm attaching two text files with the HTTP request/response output from my sample query on both the BaseX and eXist RESTXQ implementations. Non-ASCII characters do not seem to be a problem in the eXist implementation. You can see that eXist uses "Transer-Encoding: chunked" whereas BaseX uses "Content-Transfer-Encoding: base64." But I'm afraid I'm getting out of my depth here!
Thanks again,
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Mon, Mar 13, 2017 at 12:00 PM, Maximilian Gärber mgaerber@arcor.de wrote:
Hi,
this might differ from sending xml files, but if you sent any other file (image, word document) there is usually no conversion at all - just sending plain bytes (the headers do not even mention any encoding).
From my understanding, it would be the users responsibilty to decide over the transfer encoding (if you do not specify it, then there might be some fallback, but currently you are forced to base64 - no matter what the headers already are).
Br, Max
2017-03-11 18:17 GMT+01:00 Tim Thompson timathom@gmail.com:
Hi, Christian,
Thanks very much for looking into this. If I use the OxGarage TEI web service through the front-end client to upload a file ( http://www.tei-c.org/oxgarage/), here is how it sends the request payload on the back end. Non-ASCII characters are replaced with octal escape sequences.
Encapsulated multipart part: (text/xml) Content-Disposition: form-data; name="fileToConvert"; filename="tei.xml"\r\n Content-Type: text/xml\r\n\r\n eXtensible Markup Language <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en"> <teiHeader> <fileDesc> <titleStmt> <title>Multipart test</title> <author/> </titleStmt> <publicationStmt> <p>unknown</p> </publicationStmt> <sourceDesc> <p>unknown</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <div type="level1"> <div type="level2"> <p n="4"> <hi rendition="simple:bold"/> </p> <p n="5" rend="Normal"> <hi rend="bold underline"> Regression Equation </hi> </p> <p n="6" rend="Normal"> <math xmlns=" http://www.w3.org/1998/Math/MathML"> <mover accent="true"> <mrow> <mi> Y </mi> </mrow> <mo> ^ </mo> </mover> <mo> = </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 1 </mn> </mrow> </msub> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mn> 2 </mn> </mrow> </msub> <mo> + </mo> <mo> \342\200\246 </mo> <mo> + </mo> <msub> <mrow> <mi> \316\262 </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> <msub> <mrow> <mi> X </mi> </mrow> <mrow> <mi> i </mi> </mrow> </msub> </math> </p> </div> </div> </body> </text> </TEI> Boundary: \r\n-------------------------- ---10775069631632435281298450283\r\n
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
www.linkedin.com/in/timathompson tat2@princeton.edu
On Sat, Mar 11, 2017 at 10:30 AM, Christian Grün < christian.gruen@gmail.com> wrote:
Hi Tim,
Finally some feedback on this issue.
It turned out that I cannot provide an easy fix for the problem you encountered. Your observations have already summarized the problem, and you have also found out what is happening internally: Whenever a multi-part body contains non-ASCII data, the "Content-Transfer-Encoding:base64" header is added [1].
I am now mostly wondering how non-ASCII characters should be transferred, if not encoded as base64. Do you have some idea how the request would need to look like for TEI-C to be parseable?
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/ main/java/org/basex/util/http/HttpClient.java#L271
> Content-Type: text/xml\r\n > Content-Transfer-Encoding: base64\r\n\r\n > eXtensible Markup Language > [truncated] > PGh0bWwgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0bWwiPjxo ZWFkPjxtZXRhLz48\r\ndGl0bGU+VGVzdDwvdGl0bGU+PC9oZWFkPjxib2R5 PjxtYXRoIHhtbG5zPSJodHRwOi8vd3d3Lncz\r\nLm9yZy8xOTk4L01hdGgv TWF0aE1MIj48bXN1Yj48bWk+zrI8L21pPjxtbj5Ud288L21 > > Attached here is a basic test case to replicate the problem: an HTML page > with a form and the RESTXQ function that it calls. > > I've tried setting a new header to specify Content-Transfer-Encoding as > "binary" instead of "base64," but it doesn't replace the default header. Is > there any way that the encoding could be controlled from RESTXQ? > > Thanks in advance! > > Tim > > -- > Tim A. Thompson > Metadata Librarian (Spanish/Portuguese Specialty) > Princeton University Library > > www.linkedin.com/in/timathompson > tat2@princeton.edu
basex-talk@mailman.uni-konstanz.de