when number of songs beceme more 500 ,query will became very slow

worldveil / dejavu

Audio fingerprinting and recognition in Python

MIT License

6.36k stars 1.43k forks source link

when number of songs beceme more 500 ,query will became very slow #30

Closed suxianbaozi closed 10 years ago

suxianbaozi commented 10 years ago

when number of songs beceme more 500 ,query will became very slow ,more than several minutes,

worldveil commented 10 years ago

I haven't actually tried it with that many songs, it could very well be.

Could you perhaps do some SQL profiling and see where the issue is? I'm somewhat surprised since the hash is indexed.

How many GB is your fingerprints table?

suxianbaozi commented 10 years ago

I'm sorry , I have checked that the sql query is fast , but a 5s audio have about 4000 fingerprints,those fetch about 2000000 rows from mysql, to compute this will spend so much times. Also After #26 fixed hash is not primary key which caused the data became much bigger。 hope you could read my poor english,^_^

suxianbaozi commented 10 years ago

when I read Code ,I found the fignerprints are hashed from two points, I think this will make many repeat fingerprint ，could i can choose three points so that it would make the fingerprints more dispersible ，and when reconizing that will fetch little rows from mysql？

worldveil commented 10 years ago

Fingerprints table has a UNIQUE constraint, thus there should never be a duplicate (hash, offset, song_id) tuple, and thus no repeat fingerprints. The hash itself, yes, there will be many repeats of that, this is important as songs often repeat themselves.

suxianbaozi commented 10 years ago

diffent songs have same hash，when reconizing ，query only use hash as condition，that will get too many rows about 2000000 that will spend much time，for 400 or more songs

suxianbaozi commented 10 years ago

diffent songs have same hash，when reconizing ，query only use hash as condition，that will get too many rows about 2000000 that will spend much time，for 400 or more songs

worldveil commented 10 years ago

The query can't use the song_id or true offset, they are unknown.

Dejavu's constants in fingerprint.py aren't magic. They are tunable parameters that control how many hashes are made and how they are made.

You can try lowering DEFAULT_FAN_VALUE and the overlap ratio too perhaps. A larger PEAK_NEIGHBORHOOD_SIZE may also make fewer fingerprints, though possible at the cost of accuracy. A larger DEFAULT_WINDOW_SIZE will cause there to be more frequency bins, and thus more likely less collisions.

The problem you are seeing is not too many hashes, but specifically too many hashes shared between songs.

Experiment. See what works for you. And if you find good parameters for your particular use case and corpus size, do let us know.

suxianbaozi commented 10 years ago

very very thanks for your reply ,I'll try

suxianbaozi commented 10 years ago

for i in range(len(peaks)): for j in range(fan_value):

I found this code in fingerprint.py ,when j equal to zero ,the hash was made from the same two points,that will cause so many repeat hashs, is this a bug? now I have changed the fan_value to 3,and make the j start from 1 which works very good!