Hi,
I was looking for this feature in version 7. https://mailman.uni-konstanz.de/pipermail/basex-talk/2011-January/000994.htm... says it was temporarily removed from the GUI but
create fs [name] [path]
gives an error in 7.0.1. Is this feature gone for good now? /Andy
Hi Andy,
On 14.11.2011, at 13:28, Andy Bunce wrote:
Hi,
I was looking for this feature in version 7. https://mailman.uni-konstanz.de/pipermail/basex-talk/2011-January/000994.htm... says it was temporarily removed from the GUI but create fs [name] [path]
gives an error in 7.0.1. Is this feature gone for good now?
depends on what you want to achieve.
The 'create fs' command walked a given file hierarchy and produced a FSML database (Filesystem Markup Language). Whenever a 'known file type' has been encountered, a file-type specific extractor took care of it and added information about that file (ID3 tags for mp3 files, full text for pdf files etc.) to the FSML mapping. This provided the basis to later on 'query' the filesystem data.
That code has been separated from BaseX, in order to keep the core clean. Moreover, the need for some external libraries to extract metadata has been against our wish to keep BaseX as independent as possible.
If you just wish to produce a FSML mapping, XQuery using EXPath File Module [1] functionality can pretty much do the job:
declare namespace fs = "http://basex.org/fs";
declare function fs:parse($path as xs:string) as element() { let $name := replace($path, ".*[\/]", "") return if(file:is-directory($path)) then <dir name="{ $name }">{ for $f in file:list($path) return fs:parse($f) }</dir> else <file name="{ $name }" size="{ file:size($path) }" /> };
If you also want to have the extractor functionality ... we thought about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
Thanks, Alex
[1] http://docs.basex.org/wiki/File_Functions [2] http://docs.basex.org/wiki/Packaging
It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way to go.
/Andy
On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.comwrote:
On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]
If you also want to have the extractor functionality ... we thought
about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
++
Cheers, John
On 14.11.2011, at 21:48, Andy Bunce wrote:
It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way to go.
/Andy
On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.com wrote: On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]
If you also want to have the extractor functionality ... we thought about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
++
Cheers, John
Thanks for your feedback. We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.
It would be interesting to hear what kind of file types are relevant for you. The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:
<file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283"> <folder name="ID3v2"> <fact name="Title">Locker Bleiben</fact> <fact name="Artist">Die Fantastischen Vier</fact> <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact> <fact name="Album">Lauschgift</fact> <fact name="Track">15/20</fact> <fact name="PartOfSet">1/1</fact> <fact name="Year">1995</fact> <fact name="Genre">Hip Hop/Rap</fact> <fact name="Compilation">1</fact> <fact name="Comment">(iTunPGAP) 0</fact> <fact name="EncodedBy">iTunes 8.0.2</fact> </folder> <folder name="Cover"> ... </folder> </file>
Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files. Extract full text and publisher metadata from PDF files, etc.
If you have something special or want to comment on this, I'm all ears.
Thanks, Alex
[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging [1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138 [1] http://www.sno.phy.queensu.ca/~phil/exiftool/
I am mainly interested in image, (usually jpg ), and audio (usually mp3) I dont know much about Exiftool but it seems to be a Perl library. Nothing wrong with that :-), but sounds an heavy choice to wrap in a java package?
xmlcalabash has cx:metadata-extractor extension step; for images a thin shell around Drew Noakes' library of the same namehttp://www.drewnoakes.com/code/exif/ . http://xmlcalabash.com/download/Mentioned athttp://xmlcalabash.com/download/
Mp3 is more tricky, but https://github.com/mpatric/mp3agic looks like a possible candidate to me.
/Andy
On Tue, Nov 15, 2011 at 2:48 PM, Alexander Holupirek < alexander.holupirek@uni-konstanz.de> wrote:
On 14.11.2011, at 21:48, Andy Bunce wrote:
It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way to
go.
/Andy
On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.com
wrote:
On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]
If you also want to have the extractor functionality ... we thought
about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
++
Cheers, John
Thanks for your feedback. We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.
It would be interesting to hear what kind of file types are relevant for you. The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:
<file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283"> <folder name="ID3v2"> <fact name="Title">Locker Bleiben</fact> <fact name="Artist">Die Fantastischen Vier</fact> <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact> <fact name="Album">Lauschgift</fact> <fact name="Track">15/20</fact> <fact name="PartOfSet">1/1</fact> <fact name="Year">1995</fact> <fact name="Genre">Hip Hop/Rap</fact> <fact name="Compilation">1</fact> <fact name="Comment">(iTunPGAP) 0</fact> <fact name="EncodedBy">iTunes 8.0.2</fact> </folder> <folder name="Cover"> ... </folder>
</file>
Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files. Extract full text and publisher metadata from PDF files, etc.
If you have something special or want to comment on this, I'm all ears.
Thanks, Alex
[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging [1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138 [1] http://www.sno.phy.queensu.ca/~phil/exiftool/
Or apache tika covers a lot of ground... http://tika.apache.org/1.0/formats.html#Supported_Document_Formats
On Tue, Nov 15, 2011 at 11:34 PM, Andy Bunce bunce.andy@gmail.com wrote:
I am mainly interested in image, (usually jpg ), and audio (usually mp3) I dont know much about Exiftool but it seems to be a Perl library. Nothing wrong with that :-), but sounds an heavy choice to wrap in a java package?
xmlcalabash has cx:metadata-extractor extension step; for images a thin shell around Drew Noakes' library of the same namehttp://www.drewnoakes.com/code/exif/ . http://xmlcalabash.com/download/Mentioned athttp://xmlcalabash.com/download/
Mp3 is more tricky, but https://github.com/mpatric/mp3agic looks like a possible candidate to me.
/Andy
On Tue, Nov 15, 2011 at 2:48 PM, Alexander Holupirek < alexander.holupirek@uni-konstanz.de> wrote:
On 14.11.2011, at 21:48, Andy Bunce wrote:
It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way
to go.
/Andy
On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.com
wrote:
On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]
If you also want to have the extractor functionality ... we thought
about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
++
Cheers, John
Thanks for your feedback. We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.
It would be interesting to hear what kind of file types are relevant for you. The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:
<file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283"> <folder name="ID3v2"> <fact name="Title">Locker Bleiben</fact> <fact name="Artist">Die Fantastischen Vier</fact> <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact> <fact name="Album">Lauschgift</fact> <fact name="Track">15/20</fact> <fact name="PartOfSet">1/1</fact> <fact name="Year">1995</fact> <fact name="Genre">Hip Hop/Rap</fact> <fact name="Compilation">1</fact> <fact name="Comment">(iTunPGAP) 0</fact> <fact name="EncodedBy">iTunes 8.0.2</fact> </folder> <folder name="Cover"> ... </folder>
</file>
Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files. Extract full text and publisher metadata from PDF files, etc.
If you have something special or want to comment on this, I'm all ears.
Thanks, Alex
[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging [1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138 [1] http://www.sno.phy.queensu.ca/~phil/exiftool/
On 11/14/2011 08:17 PM, Alexander Holupirek wrote:
If you also want to have the extractor functionality ... we thought
about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
Hi Alex,
BTW: I would maybe provide a switch or something to disable the traversal of hidden files (for instance .svn folders :-)). For my own requirenments it seems to be sufficient to ignore hidden files at all. Hopefully my little changes are sufficient ;-)
dents = os.listdir(dpath) if depth != 0: if not dents: visit_empty_directory(dpath, depth) return if not dents[0].startswith('.'): visit_enter_directory(dpath, depth) for file_ in dents: if not file_.startswith('.'): path = os.path.join(dpath, file_) if os.path.islink(path): visit_link(path, depth + 1) elif os.path.isdir(path): descend(path, depth + 1, xdcr_map) else: visit_file(path, depth + 1, xdcr_map) if depth != 0: if not dents[0].startswith('.'): visit_leave_directory(depth)
kind regards, Johannes
Haven't really been following the conversation so if this is already there ignore me, but...
... or, a regexp pattern of files/directories to ignore/include.
Kevin
On Mon, Nov 14, 2011 at 12:03 PM, Johannes.Lichtenberger Johannes.Lichtenberger@uni-konstanz.de wrote:
On 11/14/2011 08:17 PM, Alexander Holupirek wrote:
If you also want to have the extractor functionality ... we thought
about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.
Hi Alex,
BTW: I would maybe provide a switch or something to disable the traversal of hidden files (for instance .svn folders :-)). For my own requirenments it seems to be sufficient to ignore hidden files at all. Hopefully my little changes are sufficient ;-)
dents = os.listdir(dpath) if depth != 0: if not dents: visit_empty_directory(dpath, depth) return if not dents[0].startswith('.'): visit_enter_directory(dpath, depth) for file_ in dents: if not file_.startswith('.'): path = os.path.join(dpath, file_) if os.path.islink(path): visit_link(path, depth + 1) elif os.path.isdir(path): descend(path, depth + 1, xdcr_map) else: visit_file(path, depth + 1, xdcr_map) if depth != 0: if not dents[0].startswith('.'): visit_leave_directory(depth)
kind regards, Johannes
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de