Or apache tika covers a lot of ground...
 http://tika.apache.org/1.0/formats.html#Supported_Document_Formats

On Tue, Nov 15, 2011 at 11:34 PM, Andy Bunce <bunce.andy@gmail.com> wrote:
I am mainly interested in image, (usually jpg ), and audio (usually mp3)
I dont know much about Exiftool but it seems to be a Perl library. Nothing wrong with that :-), but sounds an heavy choice to wrap in a java package?
  
xmlcalabash has cx:metadata-extractor extension step; for images a thin shell around Drew Noakes' library of the same name. Mentioned at http://xmlcalabash.com/download/

Mp3 is more tricky, but https://github.com/mpatric/mp3agic looks like a possible candidate to me.

/Andy


On Tue, Nov 15, 2011 at 2:48 PM, Alexander Holupirek <alexander.holupirek@uni-konstanz.de> wrote:

On 14.11.2011, at 21:48, Andy Bunce wrote:

> It is the metadata extraction part that is non trivial.
> So packaging the libraries and calls for that sounds like a great way to go.
>
> /Andy
>
> On Mon, Nov 14, 2011 at 7:22 PM, John D. Mitchell <jdmitchell@gmail.com> wrote:
> On Nov 14, 2011, at 11:17 , Alexander Holupirek wrote:
> [...]
> > If you also want to have the extractor functionality ... we thought about packaging [2] it for BaseX and make it available as XQuery functions.  Just give us a hint and we will get going.
>
> ++
>
> Cheers,
> John

Thanks for your feedback.  We decided to go for the packaging approach and to provide an EXPath package [0] in order to produce a FSML database of a given file hierarchy.

It would be interesting to hear what kind of file types are relevant for you.
The idea is to have transducer code [1] that, for example, extracts ID3 information for audio files:

  <file name="LockerBleiben.mp3" suffix="mp3" st_mode="0100644" st_size="4585915" st_mtime="1320945388000" st_uid="1000" st_gid="1000" st_nlink="1" bsid="70622d84-f4f7-4b90-95e2-9e1821e8d283">
     <folder name="ID3v2">
       <fact name="Title">Locker Bleiben</fact>
       <fact name="Artist">Die Fantastischen Vier</fact>
       <fact name="Composer">Andreas Rieke/Michael DJ Beck/Thomas Dürr/Michael B. Schmidt</fact>
       <fact name="Album">Lauschgift</fact>
       <fact name="Track">15/20</fact>
       <fact name="PartOfSet">1/1</fact>
       <fact name="Year">1995</fact>
       <fact name="Genre">Hip Hop/Rap</fact>
       <fact name="Compilation">1</fact>
       <fact name="Comment">(iTunPGAP) 0</fact>
       <fact name="EncodedBy">iTunes 8.0.2</fact>
     </folder>
     <folder name="Cover">
       ...
     </folder>
   </file>

Currently I think about using exiftool[1] by Phil Harvey to include metadata about numerous multi-media files.
Extract full text and publisher metadata from PDF files, etc.

If you have something special or want to comment on this, I'm all ears.

Thanks,
       Alex


[0] EXPath Packaging: http://docs.basex.org/wiki/Packaging
[1] Transducer coined by Gifford et.al. Semantic File System: http://dl.acm.org/citation.cfm?id=121138
[1] http://www.sno.phy.queensu.ca/~phil/exiftool/