Whatever happened to DeepFS

List overview All Threads
Download

newer

older

incremental update of index

Re: [basex-talk] BaseX Serializer

Andy Bunce

14 Nov 2011 14 Nov '11

7:28 a.m.

Hi,

I was looking for this feature in version 7. https://mailman.uni-konstanz.de/pipermail/basex-talk/2011-January/000994.htm... says it was temporarily removed from the GUI but

create fs [name] [path]

gives an error in 7.0.1. Is this feature gone for good now? /Andy

Attachments:

attachment.html (text/html — 567 bytes)

Show replies by date

Alexander Holupirek

14 Nov 14 Nov

2:17 p.m.

Hi Andy,

On 14.11.2011, at 13:28, Andy Bunce wrote:

...

Hi,

I was looking for this feature in version 7. https://mailman.uni-konstanz.de/pipermail/basex-talk/2011-January/000994.htm... says it was temporarily removed from the GUI but create fs [name] [path]

gives an error in 7.0.1. Is this feature gone for good now?

depends on what you want to achieve.

The 'create fs' command walked a given file hierarchy and produced a FSML database (Filesystem Markup Language). Whenever a 'known file type' has been encountered, a file-type specific extractor took care of it and added information about that file (ID3 tags for mp3 files, full text for pdf files etc.) to the FSML mapping. This provided the basis to later on 'query' the filesystem data.

That code has been separated from BaseX, in order to keep the core clean. Moreover, the need for some external libraries to extract metadata has been against our wish to keep BaseX as independent as possible.

If you just wish to produce a FSML mapping, XQuery using EXPath File Module [1] functionality can pretty much do the job:

declare namespace fs = "http://basex.org/fs";

declare function fs:parse($path as xs:string) as element() { let $name := replace($path, ".*[\/]", "") return if(file:is-directory($path)) then <dir name="{ $name }">{ for $f in file:list($path) return fs:parse($f) }</dir> else <file name="{ $name }" size="{ file:size($path) }" /> };

If you also want to have the extractor functionality ... we thought about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

Thanks, Alex

[1] http://docs.basex.org/wiki/File_Functions [2] http://docs.basex.org/wiki/Packaging

John D. Mitchell

2:22 p.m.

On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]

...

If you also want to have the extractor functionality ... we thought about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

Cheers, John

Andy Bunce

3:48 p.m.

It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way to go.

/Andy

On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.comwrote:

...

On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]

...
If you also want to have the extractor functionality ... we thought

about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

++

Cheers, John

Alexander Holupirek

15 Nov 15 Nov

9:48 a.m.

On 14.11.2011, at 21:48, Andy Bunce wrote:

...

It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way to go.

/Andy

On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.com wrote: On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]

...
If you also want to have the extractor functionality ... we thought about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

++

Cheers, John

Thanks for your feedback. We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.

It would be interesting to hear what kind of file types are relevant for you. The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:

<file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283"> <folder name="ID3v2"> <fact name="Title">Locker Bleiben</fact> <fact name="Artist">Die Fantastischen Vier</fact> <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact> <fact name="Album">Lauschgift</fact> <fact name="Track">15/20</fact> <fact name="PartOfSet">1/1</fact> <fact name="Year">1995</fact> <fact name="Genre">Hip Hop/Rap</fact> <fact name="Compilation">1</fact> <fact name="Comment">(iTunPGAP) 0</fact> <fact name="EncodedBy">iTunes 8.0.2</fact> </folder> <folder name="Cover"> ... </folder> </file>

Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files. Extract full text and publisher metadata from PDF files, etc.

If you have something special or want to comment on this, I'm all ears.

Thanks, Alex

[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging [1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138 [1] http://www.sno.phy.queensu.ca/~phil/exiftool/

Andy Bunce

6:34 p.m.

I am mainly interested in image, (usually jpg ), and audio (usually mp3) I dont know much about Exiftool but it seems to be a Perl library. Nothing wrong with that :-), but sounds an heavy choice to wrap in a java package?

xmlcalabash has cx:metadata-extractor extension step; for images a thin shell around Drew Noakes' library of the same namehttp://www.drewnoakes.com/code/exif/ . http://xmlcalabash.com/download/Mentioned athttp://xmlcalabash.com/download/

Mp3 is more tricky, but https://github.com/mpatric/mp3agic looks like a possible candidate to me.

/Andy

On Tue, Nov 15, 2011 at 2:48 PM, Alexander Holupirek < alexander.holupirek@uni-konstanz.de> wrote:

...

On 14.11.2011, at 21:48, Andy Bunce wrote:

...
It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way to

go.

...
/Andy

On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.com

wrote:

...
On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]

...
If you also want to have the extractor functionality ... we thought

about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

...
++

Cheers, John

Thanks for your feedback. We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.

It would be interesting to hear what kind of file types are relevant for you. The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:

<file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283"> <folder name="ID3v2"> <fact name="Title">Locker Bleiben</fact> <fact name="Artist">Die Fantastischen Vier</fact> <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact> <fact name="Album">Lauschgift</fact> <fact name="Track">15/20</fact> <fact name="PartOfSet">1/1</fact> <fact name="Year">1995</fact> <fact name="Genre">Hip Hop/Rap</fact> <fact name="Compilation">1</fact> <fact name="Comment">(iTunPGAP) 0</fact> <fact name="EncodedBy">iTunes 8.0.2</fact> </folder> <folder name="Cover"> ... </folder>

</file>

Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files. Extract full text and publisher metadata from PDF files, etc.

If you have something special or want to comment on this, I'm all ears.

Thanks, Alex

[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging [1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138 [1] http://www.sno.phy.queensu.ca/~phil/exiftool/

Andy Bunce

13 Dec 13 Dec

7:24 a.m.

Or apache tika covers a lot of ground... http://tika.apache.org/1.0/formats.html#Supported_Document_Formats

On Tue, Nov 15, 2011 at 11:34 PM, Andy Bunce bunce.andy@gmail.com wrote:

...

I am mainly interested in image, (usually jpg ), and audio (usually mp3) I dont know much about Exiftool but it seems to be a Perl library. Nothing wrong with that :-), but sounds an heavy choice to wrap in a java package?

xmlcalabash has cx:metadata-extractor extension step; for images a thin shell around Drew Noakes' library of the same namehttp://www.drewnoakes.com/code/exif/ . http://xmlcalabash.com/download/Mentioned athttp://xmlcalabash.com/download/

Mp3 is more tricky, but https://github.com/mpatric/mp3agic looks like a possible candidate to me.

/Andy

On Tue, Nov 15, 2011 at 2:48 PM, Alexander Holupirek < alexander.holupirek@uni-konstanz.de> wrote:

...
On 14.11.2011, at 21:48, Andy Bunce wrote:

...
It is the metadata extraction part that is non trivial. So packaging the libraries and calls for that sounds like a great way

to go.

...
/Andy

On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell jdmitchell@gmail.com

wrote:

...
On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote: [...]

...
If you also want to have the extractor functionality ... we thought

about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

...
++

Cheers, John

Thanks for your feedback. We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.

It would be interesting to hear what kind of file types are relevant for you. The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:

<file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283"> <folder name="ID3v2"> <fact name="Title">Locker Bleiben</fact> <fact name="Artist">Die Fantastischen Vier</fact> <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact> <fact name="Album">Lauschgift</fact> <fact name="Track">15/20</fact> <fact name="PartOfSet">1/1</fact> <fact name="Year">1995</fact> <fact name="Genre">Hip Hop/Rap</fact> <fact name="Compilation">1</fact> <fact name="Comment">(iTunPGAP) 0</fact> <fact name="EncodedBy">iTunes 8.0.2</fact> </folder> <folder name="Cover"> ... </folder>

</file>

Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files. Extract full text and publisher metadata from PDF files, etc.

If you have something special or want to comment on this, I'm all ears.

Thanks, Alex

[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging [1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138 [1] http://www.sno.phy.queensu.ca/~phil/exiftool/

Johannes.Lichtenberger

14 Nov 14 Nov

3:03 p.m.

On 11/14/2011 08:17 PM, Alexander Holupirek wrote:

...

If you also want to have the extractor functionality ... we thought

about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

Hi Alex,

BTW: I would maybe provide a switch or something to disable the traversal of hidden files (for instance .svn folders :-)). For my own requirenments it seems to be sufficient to ignore hidden files at all. Hopefully my little changes are sufficient ;-)

dents = os.listdir(dpath) if depth != 0: if not dents: visit_empty_directory(dpath, depth) return if not dents[0].startswith('.'): visit_enter_directory(dpath, depth) for file_ in dents: if not file_.startswith('.'): path = os.path.join(dpath, file_) if os.path.islink(path): visit_link(path, depth + 1) elif os.path.isdir(path): descend(path, depth + 1, xdcr_map) else: visit_file(path, depth + 1, xdcr_map) if depth != 0: if not dents[0].startswith('.'): visit_leave_directory(depth)

kind regards, Johannes

Kevin S. Clarke

4:13 p.m.

Haven't really been following the conversation so if this is already there ignore me, but...

... or, a regexp pattern of files/directories to ignore/include.

Kevin

On Mon, Nov 14, 2011 at 12:03 PM, Johannes.Lichtenberger Johannes.Lichtenberger@uni-konstanz.de wrote:

...

On 11/14/2011 08:17 PM, Alexander Holupirek wrote:

...
If you also want to have the extractor functionality ... we thought

about packaging [2] it for BaseX and make it available as XQuery functions. Just give us a hint and we will get going.

Hi Alex,

BTW: I would maybe provide a switch or something to disable the traversal of hidden files (for instance .svn folders :-)). For my own requirenments it seems to be sufficient to ignore hidden files at all. Hopefully my little changes are sufficient ;-)

dents = os.listdir(dpath) if depth != 0: if not dents: visit_empty_directory(dpath, depth) return if not dents[0].startswith('.'): visit_enter_directory(dpath, depth) for file_ in dents: if not file_.startswith('.'): path = os.path.join(dpath, file_) if os.path.islink(path): visit_link(path, depth + 1) elif os.path.isdir(path): descend(path, depth + 1, xdcr_map) else: visit_file(path, depth + 1, xdcr_map) if depth != 0: if not dents[0].startswith('.'): visit_leave_directory(depth)

kind regards, Johannes

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

4965

Age (days ago)

4994

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

5 participants

tags (0)

participants (5)

Alexander Holupirek
Andy Bunce
Johannes.Lichtenberger
John D. Mitchell
Kevin S. Clarke