I have a corpus of TEI files with figures and page images encoded as external entities. It appears that even when choosing “Parse DTDs and entities” this info is lost when parsing files into database, And in any case, unparsed-entity-uri() is an XSLT only function.
It would appear that I first need to transform the files first and replace @entity attributes with @url attributes while these unparsed entity values are available, before creating the database, or else generate another database to map entity names to values later.
Are there any better ways to handle this case ?
Is there any way to do these transforms on the fly before parsing the files into the database ?
The only thing that comes to mind is to set up a local SaxonServlet to do the transforms, and load from URLs instead of file paths. ( I’ve been doing something similar for a different case, and running into memory errors that I don’t see when loading from a directory when creating a database. Increasing memory didn’t help much, but inserting ‘flush’ command between even ‘add’ commands seemed to work. )
— Steve Majewski
OK: one answer to my own question is that instead of trying to tackle it by resolving entities when creating the database, I can use base-url() of the (in database) document to find the original, and then parse it as text using file:read-text-lines() .
for $LINE in file:read-text-lines( 'legacy/uvaBook/tei/PoeScap.xml' ) where starts-with( $LINE, "<!ENTITY" ) and contains( $LINE, "SYSTEM") et $TOK := tokenize( $LINE ) where $TOK[2] != "%" return element ENTITY { attribute ID { $TOK[2]}, translate($TOK[4],'"', '')}
Gives me something like:
<ENTITY ID="PoeAltit">uva-lib:488578</ENTITY> <ENTITY ID="PoeAlcov">uva-lib:488579</ENTITY> <ENTITY ID="PoeAlspi">uva-lib:488580</ENTITY>
That I could feed to a lookup function.
Or maybe a MAP would be more direct:
map:merge( for $LINE in file:read-text-lines( 'legacy/uvaBook/tei/PoeScap.xml' ) where starts-with( $LINE, "<!ENTITY" ) and contains( $LINE, "SYSTEM") let $TOK := tokenize( $LINE ) where $TOK[2] != "%" return map:entry( $TOK[2],translate($TOK[4],'"', '') ) )
Perhaps there is a way to cache these maps (or precompute) to avoid having to read and parse the text again ?
— Steve.
On Jul 24, 2019, at 3:19 PM, Majewski, Steven Dennis (sdm7g) sdm7g@virginia.edu wrote:
I have a corpus of TEI files with figures and page images encoded as external entities. It appears that even when choosing “Parse DTDs and entities” this info is lost when parsing files into database, And in any case, unparsed-entity-uri() is an XSLT only function.
It would appear that I first need to transform the files first and replace @entity attributes with @url attributes while these unparsed entity values are available, before creating the database, or else generate another database to map entity names to values later.
Are there any better ways to handle this case ?
Is there any way to do these transforms on the fly before parsing the files into the database ?
The only thing that comes to mind is to set up a local SaxonServlet to do the transforms, and load from URLs instead of file paths. ( I’ve been doing something similar for a different case, and running into memory errors that I don’t see when loading from a directory when creating a database. Increasing memory didn’t help much, but inserting ‘flush’ command between even ‘add’ commands seemed to work. )
— Steve Majewski
Hi Steve,
As you have already mentioned, XQuery offers no means to directly access the DTD rules and properties of a document. Furthermore, the there are no extension functions in BaseX (at least for now) to remember parsed DTD information.
So in fact you’ll need to extract the information from the original document, as you already did. If you access the contents repeatedly, you could store the entities for all your documents in an entities database…
let $entities := map:merge( let $root := file:base-dir() || 'docs/' for $path in trace(file:list($root, true(), '*.xml,*.dtd')) for $line in file:read-text-lines($root || $path) let $result := analyze-string($line, ``[^\s*<!ENTITY\s+(?:%\s+)?([-._\p{L}]+)\s+(['"])(.*)\2]`` ) let $name := $result//fn:group[@nr = 1] where $name (: additional support for element values :) let $value := parse-xml-fragment( $result//fn:group[@nr = 3]/text() ) return map:entry($name, $value) ) (: map is not really needed if contents end up in database :) let $xml := <entities>{ map:for-each($entities, function($name, $value) { <entity name='{ $name }'>{ $value }</entity> }) }</entities> return $xml (: db:create('entities', $xml, 'entities.xml') :)
…and access the contents in a second step:
let $name := 'dm.prop.nilled' return db:open('entities')//entity[@name = $name]/*
This surely requires that you have control over your input files (it won’t catch any entity definition that strech over multiple lines; entity definitions across all documents are unique; etc).
Hope this helps Christian
On Wed, Jul 24, 2019 at 10:57 PM Majewski, Steven Dennis (sdm7g) sdm7g@virginia.edu wrote:
OK: one answer to my own question is that instead of trying to tackle it by resolving entities when creating the database, I can use base-url() of the (in database) document to find the original, and then parse it as text using file:read-text-lines() .
for $LINE in file:read-text-lines( 'legacy/uvaBook/tei/PoeScap.xml' ) where starts-with( $LINE, "<!ENTITY" ) and contains( $LINE, "SYSTEM") et $TOK := tokenize( $LINE ) where $TOK[2] != "%" return element ENTITY { attribute ID { $TOK[2]}, translate($TOK[4],'"', '')}
Gives me something like:
<ENTITY ID="PoeAltit">uva-lib:488578</ENTITY> <ENTITY ID="PoeAlcov">uva-lib:488579</ENTITY> <ENTITY ID="PoeAlspi">uva-lib:488580</ENTITY>
That I could feed to a lookup function.
Or maybe a MAP would be more direct:
map:merge( for $LINE in file:read-text-lines( 'legacy/uvaBook/tei/PoeScap.xml' ) where starts-with( $LINE, "<!ENTITY" ) and contains( $LINE, "SYSTEM") let $TOK := tokenize( $LINE ) where $TOK[2] != "%" return map:entry( $TOK[2],translate($TOK[4],'"', '') ) )
Perhaps there is a way to cache these maps (or precompute) to avoid having to read and parse the text again ?
— Steve.
On Jul 24, 2019, at 3:19 PM, Majewski, Steven Dennis (sdm7g) sdm7g@virginia.edu wrote:
I have a corpus of TEI files with figures and page images encoded as external entities. It appears that even when choosing “Parse DTDs and entities” this info is lost when parsing files into database, And in any case, unparsed-entity-uri() is an XSLT only function.
It would appear that I first need to transform the files first and replace @entity attributes with @url attributes while these unparsed entity values are available, before creating the database, or else generate another database to map entity names to values later.
Are there any better ways to handle this case ?
Is there any way to do these transforms on the fly before parsing the files into the database ?
The only thing that comes to mind is to set up a local SaxonServlet to do the transforms, and load from URLs instead of file paths. ( I’ve been doing something similar for a different case, and running into memory errors that I don’t see when loading from a directory when creating a database. Increasing memory didn’t help much, but inserting ‘flush’ command between even ‘add’ commands seemed to work. )
— Steve Majewski
basex-talk@mailman.uni-konstanz.de