Hidden bug in ./node/main-support.ts function hashFileAsync()

geppi commented 3 years ago

There is a bug sleeping in the code that calculates the file hashes. The current version of nodejs used by VHA/Electron hides it but the current version 16.2.0 of nodejs would reveal it. The problematic line of code is 400 in ./node/main-support.ts:

fs.read(fd, data, sampleSize, sampleSize, fileSize / 2, (err, bytesRead, buffer) => {

which intends to load a sampleSize chunk of data from the middle of the file. This works for even fileSize numbers but for odd numbers the division does return a float while fs.read does expect an integer in its position parameter.

The way this is handled differs depending on the version of nodejs. In version 12.14.1 and up to version 15.6.0 the code in fs.js simply replaces the float with -1, which results in reading the data from the current file position. Up from version 15.7.0 an exception is thrown if a float is passed in the position parameter.

Unfortunately in the current version of VHA reading the data from the current file position means that the first sampleSize chunk is read again. So while for files with even fileSize the data used in the hash calculation is:

|First sampleSize bytes| + |Middle sampleSize bytes| + |Last sampleSize bytes| + |fileSize number|

for files with odd fileSize the calculation uses:

|First sampleSize bytes| + |First sampleSize bytes| + |Last sampleSize bytes| + |fileSize number|

The question is how to fix this ?

The easy way would be to manifest the current behavior explicitly in the code, i.e. check if fileSize is odd and explicitly duplicate the sampleSize data from the beginning of the file. This will keep the hash values identical to their current values.

However, to me it feels like carving an initial flaw into stone. It will for sure increase the probability that the hashes of two different, odd fileSize clips would collide, although this probability is admittedly still very small. The files would for example have to have the same fileSize and pure black frames at the beginning and at the end with identical metadata. Nevertheless this implementation would for sure not be what was intended but the least painful.

The proper fix in my mind would be to convert the resulting float down to the next integer and use a sampleSize chunk from the middle of the file also for odd fileSize clips. However, this would lead to different hashes for those files compared to the current version. Not a problem when deploying a new VHA but definitely an issue if you want to upgrade an existing one.

Maybe your work of updating VHA to Electron 12 would be a good opportunity to fix this although the nodejs version 14.16.0 that it will use does still not reveal the bug.

cal2195 commented 3 years ago

@whyboris Time to increment the vha file version (the final object property) and rehash files from the old vha file version to the new on rescan/file load?

cal2195 commented 3 years ago

Same issue as #643

geppi commented 3 years ago

Ooops, sorry for the duplication. Should have done some research of issues before posting. @RedAero I second your finding that the hash calculation is very entangled. I'm more fluent in Python too but for sure the update method of the Hash object can consume the buffer without the string conversion.

whyboris commented 3 years ago

Thank you @geppi for your contribution 🙇 I'm finally planning to address this problem with #708 😅 I'll think about what to do with hash mismatches 😓

whyboris commented 2 years ago

⚠️ this is still not fixed even though I merged in the PR updating the code. I still want to figure out how to gracefully transition from the code that has an error to one that doesn't - so the user's hubs don't lose 50% of their tags and screenshots 😓

Code in question: https://github.com/whyboris/Video-Hub-App/pull/708/files#diff-84a1283509603799dadbc0d1e879272e0cd9196aa1bd092948cbdf07ba60eca7R400

whyboris / Video-Hub-App

Hidden bug in ./node/main-support.ts function hashFileAsync() #655