Hi Tim,
I’m sorry that we dropped the text parser feature in BaseX. It didn’t really fit into the existing parser concept, which we want to further streamline in future releases. If enough people are interested, we could re-introduce a comparable functionality in a future version.
As of now, you can use fn:read-text-lines or fn:unparsed-text-lines:
let $file := 'file.txt' return db:create( 'db', element text { file:read-text-lines($file) ! element line { . } }, $file )
The handling of newlines in fn:unparsed-text will possibly be improved in XQuery 4 [1]; feel free to add your suggestions to this issue or others.
Hope this helps, Christian
[1] https://github.com/qt4cg/qtspecs/issues/216
On Wed, Aug 9, 2023 at 4:23 AM Tim Thompson timathom@gmail.com wrote:
Congrats on the latest version! Looking forward as usual to exploring the new features.
However, I'm perplexed by the decision to remove the text parser from the codebase. I understand the desire to streamline and remove dependencies related to lower-value features, but I've always found the text parser to be super useful. After installing Basex 10.8 beta today, I had to refactor a process (parsing a set of interview transcripts generated by Zoom) that involved creating a DB from a directory of text files.
In addition, I noticed some unexpected results in how the text was parsed using standard methods. In BaseX 10.6, using the text parser in the GUI, the output looks like this:
<text>WEBVTT
1 00:00:02.910 --> 00:00:27.240 ...
</text>
Here, each line end is just a newline character (\n).
Using file:read-text or fn:unparsed-text (in 10.6 and 10.8 beta), the output looks like this:
<text>WEBVTT
 
 1
 00:00:02.910 --> 00:00:27.240
 ...
</text>
Here, each line end also has a carriage return (\r).
And if instead, I store it as an XQuery value, I see the newline characters that aren't otherwise displayed in the GUI:
"WEBVTT

1
00:00:02.910 --> 00:00:27.240
..."
So, the text parser seems to have done some normalization, which was also helpful.
Any chance that it could be restored (by popular demand) in version 11? :)
Best regards, Tim