Hello!

Just wanted to report back that it works really well. It is about 50% slower than running the md5 command on the command line of my mac. A 4.15 gb file takes around 20 seconds in BaseX compared to 10 seconds using the native command. 

Not sure if this is a limitation in Java or if performance could be tweaked further. But at the moment it feels unimportant for our case. 

Thank you again for your swift reply and delivery!

Regards,
Johan Mörén


On Sun Jan 25 2015 at 1:56:21 PM Johan Mörén <johan.moren@gmail.com> wrote:
Great news Christian. I'll try it out tomorrow at work!

/Johan

On Sun, Jan 25, 2015 at 1:22 PM, Christian Grün <christian.gruen@gmail.com> wrote:
Hi Johan,

A new snapshot is available [1]. In the course of rewriting the
hashing code, I further improved our streamlining architecture [2, 3].

Your testing feedback is welcome,
Christian

[1] http://files.basex.org/releases/latest/
[2] https://github.com/BaseXdb/basex/commit/b39b7
[3] https://github.com/BaseXdb/basex/commit/28139



On Sat, Jan 24, 2015 at 8:39 PM, Christian Grün
<christian.gruen@gmail.com> wrote:
> Thanks, this makes it much easier. I'll probably go for this one:
>
> MessageDigest md = MessageDigest.getInstance(algo);
> try(InputStream is = ...) {
>   try(DigestInputStream dis = new DigestInputStream(is, md)) {
>     while(dis.read() != -1);
>   }
>   return md.digest();
> }
>
> Keeping you updated,
> Christian
>
>
> On Sat, Jan 24, 2015 at 7:39 PM, Johan Mörén <johan.moren@gmail.com> wrote:
>> Hi Christian
>>
>> I think you can go with Javas implementation all the way. like this
>>
>> MessageDigest md = MessageDigest.getInstance("MD5");
>> InputStream is = new FileInputStream("C:\\Temp\\Small\\Movie.mp4"); // Size
>> 700 MB
>>
>> byte [] buffer = new byte [blockSize];
>> int numRead;
>> do
>> {
>>  numRead = is.read(buffer);
>>  if (numRead > 0)
>>  {
>>   md.update(buffer, 0, numRead);
>>  }
>> } while (numRead != -1);
>>
>> byte[] digest = md.digest();
>>
>>
>> On Sat Jan 24 2015 at 6:49:18 PM Christian Grün <christian.gruen@gmail.com>
>> wrote:
>>>
>>> Hi Johan,
>>>
>>> looks like a useful feature! Currently, we use Java's default
>>> implementation for computing hashes [1]. If you want to help us, you
>>> could look out for an existing Java md5 hashing source code, which we
>>> could then adopt in BaseX!
>>>
>>> Best,
>>> Christian
>>>
>>> [1]
>>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/hash/HashFn.java
>>>
>>>
>>> On Sat, Jan 24, 2015 at 11:37 AM, Johan Mörén <johan.moren@gmail.com>
>>> wrote:
>>> > Hello!
>>> >
>>> > We have been using the hashing module to calculate md5 checksums on
>>> > binary
>>> > files successfully for a while. But last week we received our first
>>> > really
>>> > large file (4.3 gb) and our script threw a
>>> >
>>> > java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>> >
>>> > We are currently using the 7.8 version of BaseX. I suspect that BaseX
>>> > materialize the stream returned by file:read-binary as a byte-array when
>>> > we
>>> > call the hash:md5 function.
>>> >
>>> > This is a snippet of our script where the problem arises
>>> > ...
>>> > let $binary := file:read-binary($filePath)
>>> > let $checksum := lower-case(xs:string(xs:hexBinary(hash:md5($binary))))
>>> > ...
>>> >
>>> > I think a nice feature to add to BaseX could either be a new function in
>>> > the
>>> > file-module called file-checksum($algorithm) that calculates checksum on
>>> > files in a streaming fashion. Or perhaps an option to the hashing
>>> > functions
>>> > that indicates that you want them to use streaming.
>>> >
>>> > Regards,
>>> > Johan Mörén