xandkar / phorg

Idempotent photo/video file organizer
GNU General Public License v3.0
1 stars 1 forks source link

Dedup #6

Open xandkar opened 2 months ago

xandkar commented 2 months ago

Unique destination determination is already deduping, so the question is whether to remove one of the possibly duplicate sources after at least one of them has been moved-or-copied to the destination.

This could be a CLI option.

Then the question is whether to remove a source if we skipped moving/copying it due to destination already existing.

For safety, this could include an additional file-equality check, maybe even full byte-by-byte rather than just hashing.

xandkar commented 1 month ago

In the rare case of a hash collision we could use the same strategy as a hash table:

Think of the destination as a collection rather than a singleton, then:

if collection is empty:
  add file to collection
else
  compare file to existing members until first match or end of collection
  if something matched:
    skip adding file (i.e. we already have a copy of it)
  else
    add file to collection

Collection could be implemented as a sequence number appended to the filename:

<date>--<time>--<digest>--<sequence>[.<extension>]

This will let us confidently use fast hashing functions, such as xxHash (#13).

Trouble is that now the filename will depend on the order in which phorg was applied, so we still need a way to make it deterministic.

A deterministic alternative is to keep escalating the strength of the hash function until collision doesn't occur, appending each used digest to the final filename.