Open xandkar opened 2 months ago
In the rare case of a hash collision we could use the same strategy as a hash table:
Think of the destination as a collection rather than a singleton, then:
if collection is empty:
add file to collection
else
compare file to existing members until first match or end of collection
if something matched:
skip adding file (i.e. we already have a copy of it)
else
add file to collection
Collection could be implemented as a sequence number appended to the filename:
<date>--<time>--<digest>--<sequence>[.<extension>]
This will let us confidently use fast hashing functions, such as xxHash
(#13).
Trouble is that now the filename will depend on the order in which phorg
was applied, so we still need a way to make it deterministic.
A deterministic alternative is to keep escalating the strength of the hash function until collision doesn't occur, appending each used digest to the final filename.
Unique destination determination is already deduping, so the question is whether to remove one of the possibly duplicate sources after at least one of them has been moved-or-copied to the destination.
This could be a CLI option.
Then the question is whether to remove a source if we skipped moving/copying it due to destination already existing.
For safety, this could include an additional file-equality check, maybe even full byte-by-byte rather than just hashing.