Hi,
I am trying to implement the algorithm outlined at https://tools.ietf.org/html/rfc3986#section-5.2.4
(Two other implementations are at [1] and [2])
The following function is my work in progress. It seems, all goes fine, till the last `else`, which is commented with `(: E :)`. I hit the tail recursion barrier and I do not see why.
The function may not (yet) make sense, since I am still experimenting.
`local:remove-dot-segments("/a/b/c/./../../g",[])`
``` (:~ Remove dot segments from a URI path component according to RFC 3986 Section 5.2.4 @param $path the path component of an URL @param $out an empty (!) array @return the path component with the relative part (dot segments) removed and resolved to the absolute path. @see https://tools.ietf.org/html/rfc3986#section-5.2.4 :) declare function local:remove-dot-segments( $path as xs:string, $out as array(*)) as item()* { concat("in: ", $path, " out: ", string-join(array:flatten($out))), if (not(contains($path, "."))) then $path else if (string-length($path) > 0) then (: A :) if (substring($path,1,3) = "../") then local:remove-dot-segments(substring($path, 4), $out) else if (substring($path, 1, 2) = "./") then local:remove-dot-segments(substring($path, 3), $out) else (: B :) if (substring($path,1,3) = "/./") then local:remove-dot-segments(concat("/", substring($path, 4)), $out) else if (substring($path, 1, 2) = "/.") then local:remove-dot-segments(concat("/", substring($path, 3)), $out) else (: C :) if (substring($path, 1, 4) = "/../") then let $ret := array:remove($out, array:size($out)) return local:remove-dot-segments(concat("/", substring($path, 5)), $ret) else if (substring($path, 1, 3) = "/..") then let $ret := array:remove($out, array:size($out)) return local:remove-dot-segments(concat("/", substring($path, 4)), $ret) else (: D :) if ($path = ".." or $path = ".") then local:remove-dot-segments(concat(substring((), 4), ""), $out) else (: E :) let $ret := if (starts-with($path, "/")) then array:append($out, concat("/", substring-before(substring($path,2), "/"))) else array:append($out, substring-before(substring($path,1), "/")) return local:remove-dot-segments($path, $ret) else $path };
```
Sorry for throwing such a big function at you, but I do not find the place, where the stack overflow gets triggered. It does not happen, if I return an empty sequence at `else (: E :)`. If I replace that part (after `else (: E :)`) with `local:remove-dot-segments($path, $out)` I get the stack overflow as well, which is, what confuses me the most.
[1]https://github.com/ariutta/remove-dot-segments/blob/master/index.js [2] https://gist.github.com/rdlowrey/5f56cc540099de9d5006
I think, I am finding it...
I replaced the first expression with
`prof:dump(concat("in: ", $path, " out: ", string-join(array:flatten($out)))),`
and this gives me more info. So I may be able to solve this alone.
Hi Andreas, I don't know whether I correctly understood you use-case but what about going with hof functions [1]? Maybe your code could turn to something as simple as
declare function local:topath($path){ let $pathseg := tokenize($path, "/") return fold-left($pathseg, (), function($out, $segment){ if($segment = "." or $segment = "") then $out else if($segment = "..") then $out[position() lt count($out)] else ($out, $segment) }) }; local:topath("/a/b/c/../../../g")
Regards, Marco.
[1] http://docs.basex.org/wiki/Higher-Order_Functions
On 01/04/19 15:49, Andreas Mixich wrote:
I think, I am finding it...
I replaced the first expression with
`prof:dump(concat("in: ", $path, " out: ", string-join(array:flatten($out)))),`
and this gives me more info. So I may be able to solve this alone.
Marco Lettere wrote onm 01.04.2019 at 18:01:
declare function local:topath($path){ let $pathseg := tokenize($path, "/") return fold-left($pathseg, (), function($out, $segment){ if($segment = "." or $segment = "") then $out else if($segment = "..") then $out[position() lt count($out)] else ($out, $segment) })}; local:topath("/a/b/c/../../../g")
Beautiful and very close, except for a minor caveat: the "/" are needed, to reconstruct the absolute path-part from the relative. I played around and tried to place some "/" to your function, but all variants placed some "/" wrong or twice.
Last but not least: You asked for the use case, which is described on https://tools.ietf.org/html/rfc3986#section-5.2.4 (note also the sequences shown after the description of the steps)
5.2.4 https://tools.ietf.org/html/rfc3986#section-5.2.4. Remove Dot Segments
The pseudocode also refers to a "remove_dot_segments" routine for interpreting and removing the special "." and ".." complete path segments from a referenced path. This is done after the path is extracted from a reference, whether or not the path was relative, in order to remove any invalid or extraneous dot-segments prior to forming the target URI. Although there are many ways to accomplish this removal process, we describe a simple method using two string buffers.
1. The input buffer is initialized with the now-appended path components and the output buffer is initialized to the empty string.
2. While the input buffer is not empty, loop as follows:
A. If the input buffer begins with a prefix of "../" or "./", then remove that prefix from the input buffer; otherwise,
B. if the input buffer begins with a prefix of "/./" or "/.", where "." is a complete path segment, then replace that prefix with "/" in the input buffer; otherwise,
C. if the input buffer begins with a prefix of "/../" or "/..", where ".." is a complete path segment, then replace that prefix with "/" in the input buffer and remove the last segment and its preceding "/" (if any) from the output buffer; otherwise,
D. if the input buffer consists only of "." or "..", then remove that from the input buffer; otherwise,
E. move the first path segment in the input buffer to the end of the output buffer, including the initial "/" character (if any) and any subsequent characters up to, but not including, the next "/" character or the end of the input buffer.
3. Finally, the output buffer is returned as the result of remove_dot_segments.
Note that dot-segments are intended for use in URI references to express an identifier relative to the hierarchy of names in the base URI. The remove_dot_segments algorithm respects that hierarchy by removing extra dot-segments rather than treat them as an error or leaving them to be misinterpreted by dereference implementations.
The following illustrates how the above steps are applied for two examples of merged paths, showing the state of the two buffers after each step.
STEP OUTPUT BUFFER INPUT BUFFER
1 : /a/b/c/./../../g 2E: /a /b/c/./../../g 2E: /a/b /c/./../../g 2E: /a/b/c /./../../g 2B: /a/b/c /../../g 2C: /a/b /../g 2C: /a /g 2E: /a/g
STEP OUTPUT BUFFER INPUT BUFFER
1 https://tools.ietf.org/html/rfc3986#section-1 : mid/content=5/../6 2E: mid /content=5/../6 2E: mid/content=5 /../6 2C: mid /6 2E: mid/6
If you need to join the resulting strings just use .... well ... string-join ...
declare function local:topath($path){ let $pathseg := tokenize($path, "/") let $pathsequence := fold-left($pathseg, (), function($out, $segment){ if($segment = "." or $segment = "") then $out else if($segment = "..") then $out[position() lt count($out)] else ($out, $segment) }) return string-join($pathsequence, "/") };
local:topath("/a/b/c/../../../g")
On 01/04/19 22:02, Andreas Mixich wrote:
Marco Lettere wrote onm 01.04.2019 at 18:01:
declare function local:topath($path){ let $pathseg := tokenize($path, "/") return fold-left($pathseg, (), function($out, $segment){ if($segment = "." or $segment = "") then $out else if($segment = "..") then $out[position() lt count($out)] else ($out, $segment) })}; local:topath("/a/b/c/../../../g")
Beautiful and very close, except for a minor caveat: the "/" are needed, to reconstruct the absolute path-part from the relative. I played around and tried to place some "/" to your function, but all variants placed some "/" wrong or twice.
Last but not least: You asked for the use case, which is described on https://tools.ietf.org/html/rfc3986#section-5.2.4 (note also the sequences shown after the description of the steps)
5.2.4 <https://tools.ietf.org/html/rfc3986#section-5.2.4>. Remove Dot Segments The pseudocode also refers to a "remove_dot_segments" routine for interpreting and removing the special "." and ".." complete path segments from a referenced path. This is done after the path is extracted from a reference, whether or not the path was relative, in order to remove any invalid or extraneous dot-segments prior to forming the target URI. Although there are many ways to accomplish this removal process, we describe a simple method using two string buffers. 1. The input buffer is initialized with the now-appended path components and the output buffer is initialized to the empty string. 2. While the input buffer is not empty, loop as follows: A. If the input buffer begins with a prefix of "../" or "./", then remove that prefix from the input buffer; otherwise, B. if the input buffer begins with a prefix of "/./" or "/.", where "." is a complete path segment, then replace that prefix with "/" in the input buffer; otherwise, C. if the input buffer begins with a prefix of "/../" or "/..", where ".." is a complete path segment, then replace that prefix with "/" in the input buffer and remove the last segment and its preceding "/" (if any) from the output buffer; otherwise, D. if the input buffer consists only of "." or "..", then remove that from the input buffer; otherwise, E. move the first path segment in the input buffer to the end of the output buffer, including the initial "/" character (if any) and any subsequent characters up to, but not including, the next "/" character or the end of the input buffer. 3. Finally, the output buffer is returned as the result of remove_dot_segments. Note that dot-segments are intended for use in URI references to express an identifier relative to the hierarchy of names in the base URI. The remove_dot_segments algorithm respects that hierarchy by removing extra dot-segments rather than treat them as an error or leaving them to be misinterpreted by dereference implementations. The following illustrates how the above steps are applied for two examples of merged paths, showing the state of the two buffers after each step. STEP OUTPUT BUFFER INPUT BUFFER 1 : /a/b/c/./../../g 2E: /a /b/c/./../../g 2E: /a/b /c/./../../g 2E: /a/b/c /./../../g 2B: /a/b/c /../../g 2C: /a/b /../g 2C: /a /g 2E: /a/g STEP OUTPUT BUFFER INPUT BUFFER 1 <https://tools.ietf.org/html/rfc3986#section-1> : mid/content=5/../6 2E: mid /content=5/../6 2E: mid/content=5 /../6 2C: mid /6 2E: mid/6
basex-talk@mailman.uni-konstanz.de