Closed suxianbaozi closed 10 years ago
I haven't actually tried it with that many songs, it could very well be.
Could you perhaps do some SQL profiling and see where the issue is? I'm somewhat surprised since the hash is indexed.
How many GB is your fingerprints
table?
I'm sorry , I have checked that the sql query is fast , but a 5s audio have about 4000 fingerprints,those fetch about 2000000 rows from mysql, to compute this will spend so much times. Also After #26 fixed hash is not primary key which caused the data became much bigger。 hope you could read my poor english,^_^
when I read Code ,I found the fignerprints are hashed from two points, I think this will make many repeat fingerprint ,could i can choose three points so that it would make the fingerprints more dispersible ,and when reconizing that will fetch little rows from mysql?
Fingerprints table has a UNIQUE
constraint, thus there should never be a duplicate (hash, offset, song_id)
tuple, and thus no repeat fingerprints. The hash itself, yes, there will be many repeats of that, this is important as songs often repeat themselves.
diffent songs have same hash,when reconizing ,query only use hash as condition,that will get too many rows about 2000000 that will spend much time,for 400 or more songs
diffent songs have same hash,when reconizing ,query only use hash as condition,that will get too many rows about 2000000 that will spend much time,for 400 or more songs
The query can't use the song_id
or true offset
, they are unknown.
Dejavu's constants in fingerprint.py
aren't magic. They are tunable parameters that control how many hashes are made and how they are made.
You can try lowering DEFAULT_FAN_VALUE
and the overlap ratio too perhaps. A larger PEAK_NEIGHBORHOOD_SIZE
may also make fewer fingerprints, though possible at the cost of accuracy. A larger DEFAULT_WINDOW_SIZE
will cause there to be more frequency bins, and thus more likely less collisions.
The problem you are seeing is not too many hashes, but specifically too many hashes shared between songs.
Experiment. See what works for you. And if you find good parameters for your particular use case and corpus size, do let us know.
very very thanks for your reply ,I'll try
for i in range(len(peaks)): for j in range(fan_value):
I found this code in fingerprint.py ,when j equal to zero ,the hash was made from the same two points,that will cause so many repeat hashs, is this a bug? now I have changed the fan_value to 3,and make the j start from 1 which works very good!
when number of songs beceme more 500 ,query will became very slow ,more than several minutes,