worldveil / dejavu

Audio fingerprinting and recognition in Python
MIT License
6.36k stars 1.43k forks source link

Files are now uniquely identified through a SHA1 hash, as opposed to #79

Closed tjj5036 closed 9 years ago

tjj5036 commented 9 years ago

using the song name.

worldveil commented 9 years ago

Right, so the idea here is great. I just want to put it out there that since mp3 and other audio formats have metadata tags (ID3, etc) there might be two identical files (mathematically/musically speaking) that generate different hashes.

See discussion here.

However, I'll grant that for the purposes of fingerprinting and de-duping a library, there might still be some merit. What do you see as the benefits?

tjj5036 commented 9 years ago

Wow I didn't see that issue, I just saw the PR. Originally I had seen the "todo" in the source, and since I'm doing something similar anyway it made sense to open a PR. The major advantage to using a hash or something is it you have a bunch of files with the same name obviously; for instance, if you're fingerprinting live music and you're targeting a certain artist, you're going to end up with a ton of files that are something like "Smells-Like-Teen-Spirit.mp3".

Let me do a bit more research and possibly amend the PR / benchmark against identical files with different metadata.

worldveil commented 9 years ago

will review this week - apologies for delay!